APM (Applied Predictive Modeling)

An integrated package for supervised learning, using over 50 kinds of models, and a variety of different metrics:

  Applied Predictive Modeling
  M. Kuhn and K. Johnson
  Springer-Verlag, 2013.
  ISBN: 978-1-4614-6848-6 (Print)

http://link.springer.com/book/10.1007%2F978-1-4614-6849-3

[APM] is similar to [ISL] and [ESL] but emphasizes practical model development/evaluation, including case histories with R scripts.

Its caret library (http://caret.r-forge.r-project.org) is integrated with R packages that support Supervised Learning using mainstream model evaluation methods and 100 popular models.

caret package manual (PDF): http://cran.r-project.org/web/packages/caret/caret.pdf

List of models in caret (reproduced as a table below): http://caret.r-forge.r-project.org/modelList.html

caret package overview: http://www.jstatsoft.org/v28/i05/paper

In [31]:
%load_ext rpy2.ipython
In [32]:
%%R

not.installed <- function(pkg) !is.element(pkg, installed.packages()[,1])
In [33]:
%%R

if (not.installed("caret")) install.packages("caret")

library(caret)

library(help=caret)
Documentation for package 'caret'
		Information on package 'caret'

Description:

Package:            caret
Version:            6.0-41
Date:               2015-01-02
Title:              Classification and Regression Training
Author:             Max Kuhn. Contributions from Jed Wing, Steve
                    Weston, Andre Williams, Chris Keefer, Allan
                    Engelhardt, Tony Cooper, Zachary Mayer, Brenton
                    Kenkel, the R Core Team, Michael Benesty, Reynald
                    Lescarbeau, Andrew Ziem, and Luca Scrucca.
Description:        Misc functions for training and plotting
                    classification and regression models
Maintainer:         Max Kuhn <Max.Kuhn@pfizer.com>
Depends:            R (>= 2.10), stats, lattice (>= 0.20), ggplot2
URL:                http://caret.r-forge.r-project.org/
Imports:            car, reshape2, foreach, methods, plyr, nlme,
                    BradleyTerry2
Suggests:           e1071, earth (>= 2.2-3), fastICA, gam, ipred,
                    kernlab, klaR, MASS, ellipse, mda, mgcv, mlbench,
                    nnet, party (>= 0.9-99992), pls, pROC, proxy,
                    randomForest, RANN, spls, subselect, pamr, superpc,
                    Cubist, testthat (>= 0.9.1)
License:            GPL (>= 2)
NeedsCompilation:   yes
Packaged:           2015-01-02 18:40:43 UTC; kuhna03
Repository:         CRAN
Date/Publication:   2015-01-03 06:58:41
Built:              R 3.1.2; x86_64-apple-darwin13.4.0; 2015-01-04
                    06:10:33 UTC; unix

Index:

BloodBrain              Blood Brain Barrier Data
BoxCoxTrans.default     Box-Cox and Exponential Transformations
GermanCredit            German Credit Data
as.table.confusionMatrix
                        Save Confusion Table Results
avNNet.default          Neural Networks Using Model Averaging
bag.default             A General Framework For Bagging
bagEarth                Bagged Earth
bagFDA                  Bagged FDA
calibration             Probability Calibration Plot
caretFuncs              Backwards Feature Selection Helper Functions
caretSBF                Selection By Filtering (SBF) Helper Functions
cars                    Kelly Blue Book resale data for 2005 model year
                        GM cars
classDist               Compute and predict the distances to class
                        centroids
confusionMatrix         Create a confusion matrix
confusionMatrix.train   Estimate a Resampled Confusion Matrix
cox2                    COX-2 Activity Data
createDataPartition     Data Splitting functions
dhfr                    Dihydrofolate Reductase Inhibitors Data
diff.resamples          Inferential Assessments About Model Performance
dotPlot                 Create a dotplot of variable importance values
dotplot.diff.resamples
                        Lattice Functions for Visualizing Resampling
                        Differences
downSample              Down- and Up-Sampling Imbalanced Data
dummyVars               Create A Full Set of Dummy Variables
featurePlot             Wrapper for Lattice Plotting of Predictor
                        Variables
filterVarImp            Calculation of filter-based variable importance
findCorrelation         Determine highly correlated variables
findLinearCombos        Determine linear combinations in a matrix
format.bagEarth         Format 'bagEarth' objects
gafs.default            Genetic algorithm feature selection
gafs_initial            Ancillary genetic algorithm functions
histogram.train         Lattice functions for plotting resampling
                        results
icr.formula             Independent Component Regression
index2vec               Convert indicies to a binary vector
knn3                    k-Nearest Neighbour Classification
knnreg                  k-Nearest Neighbour Regression
lift                    Lift Plot
maxDissim               Maximum Dissimilarity Sampling
mdrr                    Multidrug Resistance Reversal (MDRR) Agent Data
modelLookup             Tools for Models Available in 'train'
nearZeroVar             Identification of near zero variance predictors
nullModel               Fit a simple, non-informative model
oil                     Fatty acid composition of commercial oils
oneSE                   Selecting tuning Parameters
panel.lift2             Lattice Panel Functions for Lift Plots
panel.needle            Needle Plot Lattice Panel
pcaNNet.default         Neural Networks with a Principal Component Step
plot.gafs               Plot Method for the gafs and safs Classes
plot.rfe                Plot RFE Performance Profiles
plot.train              Plot Method for the train Class
plot.varImp.train       Plotting variable importance measures
plotClassProbs          Plot Predicted Probabilities in Classification
                        Models
plotObsVsPred           Plot Observed versus Predicted Results in
                        Regression and Classification Models
plsda                   Partial Least Squares and Sparse Partial Least
                        Squares Discriminant Analysis
postResample            Calculates performance across resamples
pottery                 Pottery from Pre-Classical Sites in Italy
prcomp.resamples        Principal Components Analysis of Resampling
                        Results
preProcess              Pre-Processing of Predictors
predict.bagEarth        Predicted values based on bagged Earth and FDA
                        models
predict.gafs            Predict new samples
predict.knn3            Predictions from k-Nearest Neighbors
predict.knnreg          Predictions from k-Nearest Neighbors Regression
                        Model
predict.train           Extract predictions and class probabilities
                        from train objects
predictors              List predictors used in the model
print.confusionMatrix   Print method for confusionMatrix
print.train             Print Method for the train Class
resampleHist            Plot the resampling distribution of the model
                        statistics
resampleSummary         Summary of resampled performance estimates
resamples               Collation and Visualization of Resampling
                        Results
rfe                     Backwards Feature Selection
rfeControl              Controlling the Feature Selection Algorithms
safs.default            Simulated annealing feature selection
safsControl             Control parameters for GA and SA feature
                        selection
safs_initial            Ancillary simulated annealing functions
sbf                     Selection By Filtering (SBF)
sbfControl              Control Object for Selection By Filtering (SBF)
segmentationData        Cell Body Segmentation
sensitivity             Calculate sensitivity, specificity and
                        predictive values
spatialSign             Compute the multivariate spatial sign
summary.bagEarth        Summarize a bagged earth or FDA fit
tecator                 Fat, Water and Protein Content of Meat Samples
train                   Fit Predictive Models over Different Tuning
                        Parameters
trainControl            Control parameters for train
train_model_list        A List of Available Models in train
twoClassSim             Simulation Functions
update.safs             Update or Re-fit a SA or GA Model
update.train            Update or Re-fit a Model
varImp                  Calculation of variable importance for
                        regression and classification models
varImp.gafs             Variable importances for GAs and SAs
xyplot.resamples        Lattice Functions for Visualizing Resampling
                        Results
xyplot.rfe              Lattice functions for plotting resampling
                        results of recursive feature selection

Further information is available in the following vignettes in
directory
'/Library/Frameworks/R.framework/Versions/3.1/Resources/library/caret/doc':

caret: A Short Introduction to the caret Package (source, pdf)
Loading required package: lattice
Loading required package: ggplot2

[APM] includes examples using the following packages/models:

C5.0, J48, M5, Nelder-Mead, PART, avNNet, cforest, ctree, cubist, earth, enet, fda, gbm, glm, glmnet, knn, lda, lm, mda, nb, nnet, pam, pcr, pls, rf, ridge, rpart, sparseLDA, svmPoly, svmRadial, treebag}

Installing and getting R example scripts:

        install.packages("AppliedPredictiveModeling")
        library(AppliedPredictiveModeling)

        getPackages(1:19)  # download ALL packages used in chs 1-19, including caret
In [34]:
%%R

if (not.installed("AppliedPredictiveModeling")) {
    
    install.packages("AppliedPredictiveModeling")
    library(AppliedPredictiveModeling)
    
    for (chapter in c(2,3,4,6,7,8,10, 11,12,13,14,16,17,19))  getPackages(chapter)

} else {

    library(AppliedPredictiveModeling)

}

library(help=AppliedPredictiveModeling)
Documentation for package 'AppliedPredictiveModeling'
		Information on package 'AppliedPredictiveModeling'

Description:

Package:            AppliedPredictiveModeling
Type:               Package
Title:              Functions and Data Sets for 'Applied Predictive
                    Modeling'
Version:            1.1-6
Date:               2014-07-24
Author:             Max Kuhn, Kjell Johnson
Maintainer:         Max Kuhn <mxkuhn@gmail.com>
Description:        A few functions and several data set for the
                    Springer book 'Applied Predictive Modeling'
URL:                http://appliedpredictivemodeling.com/
Depends:            R (>= 2.10)
Imports:            CORElearn, MASS, plyr, reshape2
Suggests:           caret (>= 6.0-22), lattice, ellipse
License:            GPL
Packaged:           2014-07-25 13:37:54 UTC; kuhna03
NeedsCompilation:   no
Repository:         CRAN
Date/Publication:   2014-07-25 18:42:22
Built:              R 3.1.2; ; 2015-01-10 01:37:59 UTC; unix

Index:

AppliedPredictiveModeling-package
                        Data, Functions and Scripts for
                        'scriptLocation'
ChemicalManufacturingProcess
                        Chemical Manufacturing Process Data
abalone                 Abalone Data
bio                     Hepatic Injury Data
bookTheme               Lattice Themes
cars2010                Fuel Economy Data
concrete                Compressive Strength of Concrete from Yeh
                        (1998)
diagnosis               Alzheimer's Disease CSF Data
getPackages             Install Packages for Each Chapter
logisticCreditPredictions
                        Logistic Regression Predictions for the Credit
                        Data
permeability            Permeability Data
permuteRelief           Permutation Statistics for the Relief Algorithm
quadBoundaryFunc        Functions for Simulating Data
schedulingData          HPC Job Scheduling Data
scriptLocation          Find Chapter Script Files
segmentationOriginal    Cell Body Segmentation
trainX                  Solubility Data
twoClassData            Two Class Example Data
In [35]:
%%R

# Grid Search is often used in APM to search a model's parameter space, and
# some chapters use the "doMC" package to do Multi-Core computation
# (supported only on Linux or MacOS):

if (not.installed("doMC"))  install.packages("doMC")   # multicore computation in R

library(doMC)
library(help=doMC)
Documentation for package 'doMC'
		Information on package 'doMC'

Description:

Package:                            doMC
Type:                               Package
Title:                              Foreach parallel adaptor for the
                                    multicore package
Version:                            1.3.3
Author:                             Revolution Analytics
Maintainer:                         Revolution Analytics
                                    <packages@revolutionanalytics.com>
Description:                        Provides a parallel backend for the
                                    %dopar% function using the
                                    multicore functionality of the
                                    parallel package..
Depends:                            R (>= 2.14.0), foreach(>= 1.2.0),
                                    iterators(>= 1.0.0), parallel
Imports:                            utils
Enhances:                           compiler, RUnit
License:                            GPL-2
Repository:                         CRAN
Repository/R-Forge/Project:         domc
Repository/R-Forge/Revision:        16
Repository/R-Forge/DateTimeStamp:   2014-02-25 19:29:46
Date/Publication:                   2014-02-28 07:00:48
Packaged:                           2014-02-25 23:42:04 UTC; rforge
NeedsCompilation:                   no
OS_type:                            unix
Built:                              R 3.1.2; ; 2015-01-09 23:21:34 UTC;
                                    unix

Index:

doMC-package            The doMC Package
registerDoMC            registerDoMC

Further information is available in the following vignettes in
directory
'/Library/Frameworks/R.framework/Versions/3.1/Resources/library/doMC/doc':

gettingstartedMC: Getting Started with doMC and foreach (source, pdf)
Loading required package: foreach
foreach: simple, scalable parallel programming from Revolution Analytics
Use Revolution R for scalability, fault tolerance and more.
http://www.revolutionanalytics.com
Loading required package: iterators
Loading required package: parallel
In [36]:
import pandas as pd
pd.set_option('display.max_rows', 500)

ModelDF = pd.read_table("AppliedPredictiveModelingModelTable.tsv")

print(ModelDF.describe())

# print(ModelDF.columns)
#
# colnames = list(ModelDF)
# print(colnames)
# 
# print(ModelDF.ix[:,0:2])  # print first two columns

ModelDF
                        Model method Argument Value            Type Packages  \
count                     180                   180             180      176   
unique                    159                   180               3       88   
top     Partial Least Squares              logicBag  Classification  kernlab   
freq                        4                     1              72       17   

       Tuning Parameters  
count                180  
unique               115  
top                 None  
freq                  24  
Out[36]:
Model method Argument Value Type Packages Tuning Parameters
0 Boosted Classification Trees ada Classification ada, plyr iter, maxdepth, nu
1 Bagged AdaBoost AdaBag Classification adabag, plyr mfinal, maxdepth
2 AdaBoost.M1 AdaBoost.M1 Classification adabag, plyr mfinal, maxdepth, coeflearn
3 Adaptive Mixture Discriminant Analysis amdai Classification adaptDA model
4 Adaptive-Network-Based Fuzzy Inference System ANFIS Regression frbs num.labels, max.iter
5 Model Averaged Neural Network avNNet Dual Use nnet size, decay, bag
6 Bagged Model bag Dual Use caret vars
7 Bagged MARS bagEarth Dual Use earth nprune, degree
8 Bagged MARS using gCV Pruning bagEarthGCV Dual Use earth degree
9 Bagged Flexible Discriminant Analysis bagFDA Classification earth, mda degree, nprune
10 Bagged FDA using gCV Pruning bagFDAGCV Classification earth degree
11 Bayesian Generalized Linear Model bayesglm Dual Use arm None
12 Self-Organizing Map bdk Dual Use kohonen xdim, ydim, xweight, topo
13 Binary Discriminant Analysis binda Classification binda lambda.freqs
14 Boosted Tree blackboost Dual Use party, mboost, plyr mstop, maxdepth
15 Random Forest with Additional Feature Selection Boruta Dual Use Boruta, randomForest mtry
16 Bayesian Regularized Neural Networks brnn Regression brnn neurons
17 Boosted Linear Model bstLs Dual Use bst, plyr mstop, nu
18 Boosted Smoothing Spline bstSm Dual Use bst, plyr mstop, nu
19 Boosted Tree bstTree Dual Use bst, plyr mstop, maxdepth, nu
20 C5.0 C5.0 Classification C50, plyr trials, model, winnow
21 Cost-Sensitive C5.0 C5.0Cost Classification C50, plyr trials, model, winnow, cost
22 Single C5.0 Ruleset C5.0Rules Classification C50 None
23 Single C5.0 Tree C5.0Tree Classification C50 None
24 Conditional Inference Random Forest cforest Dual Use party mtry
25 SIMCA CSimca Classification rrcovHD None
26 Conditional Inference Tree ctree Dual Use party mincriterion
27 Conditional Inference Tree ctree2 Dual Use party maxdepth
28 Cubist cubist Regression Cubist committees, neighbors
29 Dynamic Evolving Neural-Fuzzy Inference System DENFIS Regression frbs Dthr, max.iter
30 Stacked AutoEncoder Deep Neural Network dnn Dual Use deepnet layer1, layer2, layer3, hidden_dropout, visibl...
31 Multivariate Adaptive Regression Spline earth Dual Use earth nprune, degree
32 Extreme Learning Machine elm Dual Use elmNN nhid, actfun
33 Elasticnet enet Regression elasticnet fraction, lambda
34 Ensemble Partial Least Squares Regression with... enpls.fs Regression enpls maxcomp, threshold
35 Ensemble Partial Least Squares Regression enpls Regression enpls maxcomp
36 Tree Models from Genetic Algorithms evtree Dual Use evtree alpha
37 Random Forest by Randomization extraTrees Dual Use extraTrees mtry, numRandomCuts
38 Flexible Discriminant Analysis fda Classification earth, mda degree, nprune
39 Fuzzy Rules Using Genetic Cooperative-Competit... FH.GBML Classification frbs max.num.rule, popu.size, max.gen
40 Fuzzy Inference Rules by Descent Method FIR.DM Regression frbs num.labels, max.iter
41 Ridge Regression with Variable Selection foba Regression foba k, lambda
42 Fuzzy Rules Using Chi's Method FRBCS.CHI Classification frbs num.labels, type.mf
43 Fuzzy Rules with Weight Factor FRBCS.W Classification frbs num.labels, type.mf
44 Simplified TSK Fuzzy Rules FS.HGD Regression frbs num.labels, max.iter
45 Generalized Additive Model using Splines gam Dual Use mgcv select, method
46 Boosted Generalized Additive Model gamboost Dual Use mboost mstop, prune
47 Generalized Additive Model using LOESS gamLoess Dual Use gam span, degree
48 Generalized Additive Model using Splines gamSpline Dual Use gam df
49 Gaussian Process gaussprLinear Dual Use kernlab None
50 Gaussian Process with Polynomial Kernel gaussprPoly Dual Use kernlab degree, scale
51 Gaussian Process with Radial Basis Function Ke... gaussprRadial Dual Use kernlab sigma
52 Stochastic Gradient Boosting gbm Dual Use gbm, plyr n.trees, interaction.depth, shrinkage
53 Multivariate Adaptive Regression Splines gcvEarth Dual Use earth degree
54 Fuzzy Rules via MOGUL GFS.FR.MOGAL Regression frbs max.gen, max.iter, max.tune
55 Fuzzy Rules Using Genetic Cooperative-Competit... GFS.GCCL Classification frbs num.labels, popu.size, max.gen
56 Genetic Lateral Tuning and Rule Selection of L... GFS.LT.RS Regression frbs popu.size, num.labels, max.gen
57 Fuzzy Rules via Thrift GFS.THRIFT Regression frbs popu.size, num.labels, max.gen
58 Generalized Linear Model glm Dual Use NaN None
59 Boosted Generalized Linear Model glmboost Dual Use mboost mstop, prune
60 glmnet glmnet Dual Use glmnet alpha, lambda
61 Generalized Linear Model with Stepwise Feature... glmStepAIC Dual Use MASS None
62 Generalized Partial Least Squares gpls Classification gpls K.prov
63 Heteroscedastic Discriminant Analysis hda Classification hda gamma, lambda, newdim
64 High Dimensional Discriminant Analysis hdda Classification HDclassif threshold, model
65 Hybrid Neural Fuzzy Inference System HYFIS Regression frbs num.labels, max.iter
66 Independent Component Regression icr Regression fastICA n.comp
67 C4.5-like Trees J48 Classification RWeka C
68 Rule-Based Classifier JRip Classification RWeka NumOpt
69 Partial Least Squares kernelpls Dual Use pls ncomp
70 k-Nearest Neighbors kknn Dual Use kknn kmax, distance, kernel
71 k-Nearest Neighbors knn Dual Use NaN k
72 Polynomial Kernel Regularized Least Squares krlsPoly Regression KRLS lambda, degree
73 Radial Basis Function Kernel Regularized Least... krlsRadial Regression KRLS, kernlab lambda, sigma
74 Least Angle Regression lars Regression lars fraction
75 Least Angle Regression lars2 Regression lars step
76 The lasso lasso Regression elasticnet fraction
77 Linear Discriminant Analysis lda Classification MASS None
78 Linear Discriminant Analysis lda2 Classification MASS dimen
79 Linear Regression with Backwards Selection leapBackward Regression leaps nvmax
80 Linear Regression with Forward Selection leapForward Regression leaps nvmax
81 Linear Regression with Stepwise Selection leapSeq Regression leaps nvmax
82 Robust Linear Discriminant Analysis Linda Classification rrcov None
83 Linear Regression lm Regression NaN None
84 Linear Regression with Stepwise Selection lmStepAIC Regression MASS None
85 Logistic Model Trees LMT Classification RWeka iter
86 Bagged Logic Regression logicBag Dual Use logicFS nleaves, ntrees
87 Boosted Logistic Regression LogitBoost Classification caTools nIter
88 Logic Regression logreg Dual Use LogicReg treesize, ntrees
89 Least Squares Support Vector Machine lssvmLinear Classification kernlab None
90 Least Squares Support Vector Machine with Poly... lssvmPoly Classification kernlab degree, scale
91 Least Squares Support Vector Machine with Radi... lssvmRadial Classification kernlab sigma
92 Learning Vector Quantization lvq Classification class size, k
93 Model Tree M5 Regression RWeka pruned, smoothed, rules
94 Model Rules M5Rules Regression RWeka pruned, smoothed
95 Mixture Discriminant Analysis mda Classification mda subclasses
96 Maximum Uncertainty Linear Discriminant Analysis Mlda Classification HiDimDA None
97 Multi-Layer Perceptron mlp Dual Use RSNNS size
98 Multi-Layer Perceptron mlpWeightDecay Dual Use RSNNS size, decay
99 Penalized Multinomial Regression multinom Classification nnet decay
100 Naive Bayes nb Classification klaR fL, usekernel
101 Neural Network neuralnet Regression neuralnet layer1, layer2, layer3
102 Neural Network nnet Dual Use nnet size, decay
103 Tree-Based Ensembles nodeHarvest Dual Use nodeHarvest maxinter, mode
104 Oblique Trees oblique.tree Classification oblique.tree oblique.splits, variable.selection
105 Single Rule Classification OneR Classification RWeka None
106 Oblique Random Forest ORFlog Classification obliqueRF mtry
107 Oblique Random Forest ORFpls Classification obliqueRF mtry
108 Oblique Random Forest ORFridge Classification obliqueRF mtry
109 Oblique Random Forest ORFsvm Classification obliqueRF mtry
110 Nearest Shrunken Centroids pam Classification pamr threshold
111 Parallel Random Forest parRF Dual Use randomForest mtry
112 Rule-Based Classifier PART Classification RWeka threshold, pruned
113 partDSA partDSA Dual Use partDSA cut.off.growth, MPD
114 Neural Networks with Feature Extraction pcaNNet Dual Use nnet size, decay
115 Principal Component Analysis pcr Regression pls ncomp
116 Penalized Discriminant Analysis pda Classification mda lambda
117 Penalized Discriminant Analysis pda2 Classification mda df
118 Penalized Linear Regression penalized Regression penalized lambda1, lambda2
119 Penalized Linear Discriminant Analysis PenalizedLDA Classification penalizedLDA, plyr lambda, K
120 Penalized Logistic Regression plr Classification stepPlr lambda, cp
121 Partial Least Squares pls Dual Use pls ncomp
122 Partial Least Squares Generalized Linear Models plsRglm Dual Use plsRglm nt, alpha.pvals.expli
123 Ordered Logistic or Probit Regression polr Classification MASS None
124 Projection Pursuit Regression ppr Regression NaN nterms
125 Greedy Prototype Selection protoclass Classification proxy, protoclass eps, Minkowski
126 Quadratic Discriminant Analysis qda Classification MASS None
127 Robust Quadratic Discriminant Analysis QdaCov Classification rrcov None
128 Quantile Random Forest qrf Regression quantregForest mtry
129 Quantile Regression Neural Network qrnn Regression qrnn n.hidden, penalty, bag
130 Radial Basis Function Network rbf Classification RSNNS size
131 Radial Basis Function Network rbfDDA Dual Use RSNNS negativeThreshold
132 Regularized Discriminant Analysis rda Classification klaR gamma, lambda
133 Relaxed Lasso relaxo Regression relaxo, plyr lambda, phi
134 Random Forest rf Dual Use randomForest mtry
135 Random Ferns rFerns Classification rFerns depth
136 Factor-Based Linear Discriminant Analysis RFlda Classification HiDimDA q
137 Ridge Regression ridge Regression elasticnet lambda
138 Random k-Nearest Neighbors rknn Dual Use rknn k, mtry
139 Random k-Nearest Neighbors with Feature Selection rknnBel Dual Use rknn, plyr k, mtry, d
140 Robust Linear Model rlm Regression MASS None
141 Robust Mixture Discriminant Analysis rmda Classification robustDA K, model
142 ROC-Based Classifier rocc Classification rocc xgenes
143 CART rpart Dual Use rpart cp
144 CART rpart2 Dual Use rpart maxdepth
145 Cost-Sensitive CART rpartCost Classification rpart cp, Cost
146 Regularized Random Forest RRF Dual Use randomForest, RRF mtry, coefReg, coefImp
147 Regularized Random Forest RRFglobal Dual Use RRF mtry, coefReg
148 Robust Regularized Linear Discriminant Analysis rrlda Classification rrlda lambda, hp, penalty
149 Robust SIMCA RSimca Classification rrcovHD None
150 Relevance Vector Machines with Linear Kernel rvmLinear Regression kernlab None
151 Relevance Vector Machines with Polynomial Kernel rvmPoly Regression kernlab scale, degree
152 Relevance Vector Machines with Radial Basis Fu... rvmRadial Regression kernlab sigma
153 Subtractive Clustering and Fuzzy c-Means Rules SBC Regression frbs r.a, eps.high, eps.low
154 Shrinkage Discriminant Analysis sda Classification sda diagonal, lambda
155 Stepwise Diagonal Linear Discriminant Analysis sddaLDA Classification SDDA None
156 Stepwise Diagonal Quadratic Discriminant Analysis sddaQDA Classification SDDA None
157 Partial Least Squares simpls Dual Use pls ncomp
158 Fuzzy Rules Using the Structural Learning Algo... SLAVE Classification frbs num.labels, max.iter, max.gen
159 Stabilized Linear Discriminant Analysis slda Classification ipred None
160 Sparse Mixture Discriminant Analysis smda Classification sparseLDA NumVars, lambda, R
161 Sparse Linear Discriminant Analysis sparseLDA Classification sparseLDA NumVars, lambda
162 Sparse Partial Least Squares spls Dual Use spls K, eta, kappa
163 Linear Discriminant Analysis with Stepwise Fea... stepLDA Classification klaR, MASS maxvar, direction
164 Quadratic Discriminant Analysis with Stepwise ... stepQDA Classification klaR, MASS maxvar, direction
165 Supervised Principal Component Analysis superpc Regression superpc threshold, n.components
166 Support Vector Machines with Boundrange String... svmBoundrangeString Dual Use kernlab length, C
167 Support Vector Machines with Exponential Strin... svmExpoString Dual Use kernlab lambda, C
168 Support Vector Machines with Linear Kernel svmLinear Dual Use kernlab C
169 Support Vector Machines with Polynomial Kernel svmPoly Dual Use kernlab degree, scale, C
170 Support Vector Machines with Radial Basis Func... svmRadial Dual Use kernlab sigma, C
171 Support Vector Machines with Radial Basis Func... svmRadialCost Dual Use kernlab C
172 Support Vector Machines with Class Weights svmRadialWeights Classification kernlab sigma, C, Weight
173 Support Vector Machines with Spectrum String K... svmSpectrumString Dual Use kernlab length, C
174 Bagged CART treebag Dual Use ipred, plyr None
175 Variational Bayesian Multinomial Probit Regres... vbmpRadial Classification vbmp estimateTheta
176 Partial Least Squares widekernelpls Dual Use pls ncomp
177 Wang and Mendel Fuzzy Rules WM Regression frbs num.labels, type.mf
178 Weighted Subspace Random Forest wsrf Classification wsrf mtry
179 Self-Organizing Maps xyf Dual Use kohonen xdim, ydim, xweight, topo

Chapters use the following models:

02_A_Short_Tour.R           lm, earth
04_Over_Fitting.R           svmRadial, glm
06_Linear_Regression.R      lm, pls, pcr, ridge, enet
07_Non-Linear_Reg.R         avNNet, earth, svmRadial, svmPoly, knn
08_Regression_Trees.R       rpart, ctree, M5, treebag, rf, cforest, gbm
10_Case_Study_Concrete.R    lm, pls, enet, earth, svmRadial, avNNet, rpart,
                            treebag, ctree, rf, gbm, cubist, M5, Nelder-Mead
11_Class_Performance.R      glm
12_Discriminant_Analysis.R  svmRadial, glm, lda, pls, glmnet, pam
13_Non-Linear_Class.R       mda, nnet, avNNet, fda, svmRadial, svmPoly, knn, nb
14_Class_Trees.R            rpart, J48, PART, treebag, rf, gbm, C5.0
16_Class_Imbalance.R        rf, glm, fda, svmRadial, rpart, C5.0
17_Job_Scheduling.R         rpart, lda, sparseLDA, nnet, pls, fda, rf, C5.0,
                            treebag, svmRadial
19_Feature_Select.R         rf, lda, svmRadial, nb, glm, knn, svmRadial, knn

Training control methods used by the scripts:

04_Over_Fitting.R           repeatedcv, cv, LOOCV, LGOCV, boot, boot632
06_Linear_Regression.R      cv
07_Non-Linear_Reg.R         cv
08_Regression_Trees.R       cv, oob
10_Case_Study_Concrete.R    repeatedcv
11_Class_Performance.R      repeatedcv
12_Discriminant_Analysis.R  cv, LGOCV
13_Non-Linear_Class.R       LGOCV
14_Class_Trees.R            LGOCV
16_Class_Imbalance.R        cv
17_Job_Scheduling.R         repeatedcv
19_Feature_Select.R         repeatedcv, cv

Print the script from any [APM] Chapter

In [37]:
%%R

APMchapters = c(
"",
"02_A_Short_Tour.R",
"03_Data_Pre_Processing.R",
"04_Over_Fitting.R",
"",
"06_Linear_Regression.R",
"07_Non-Linear_Reg.R",
"08_Regression_Trees.R",
"",
"10_Case_Study_Concrete.R",
"11_Class_Performance.R",
"12_Discriminant_Analysis.R",
"13_Non-Linear_Class.R",
"14_Class_Trees.R",
"",
"16_Class_Imbalance.R",
"17_Job_Scheduling.R",
"18_Importance.R",
"19_Feature_Select.R",
"CreateGrantData.R")

showChapterScript = function(n) {
  if (APMchapters[n] != "")
    file.show( file.path( scriptLocation(), APMchapters[n] ))
}

showChapterOutput = function(n) {
  if (APMchapters[n] != "")
    file.show( file.path( scriptLocation(), paste(APMchapters[n],"out",sep="") ))
}

runChapterScript = function(n) {
  if (APMchapters[n] != "")
    source( file.path( scriptLocation(), APMchapters[n] ),  echo=TRUE )
}
In [41]:
%%R

showChapterScript(2)
NULL
In [42]:
%%R

# showChapterOutput(2)
NULL
In [43]:
%%R -w 600 -h 600

runChapterScript(2)

##     user  system elapsed 
##    4.971   0.114   5.292
NULL
In [39]:
%%R

# Another way to run the script for Chapter 2:

PATIENT = TRUE

if (PATIENT) {
   current_working_directory = getwd()  # remember current directory

   chapter_code_directory = scriptLocation()

   setwd( chapter_code_directory )
   print(dir())

   print(source("02_A_Short_Tour.R", echo=TRUE))

   setwd(current_working_directory)  # return to working directory
}
 [1] "02_A_Short_Tour.R"             "02_A_Short_Tour.Rout"         
 [3] "03_Data_Pre_Processing.R"      "03_Data_Pre_Processing.Rout"  
 [5] "04_Over_Fitting.R"             "04_Over_Fitting.Rout"         
 [7] "06_Linear_Regression.R"        "06_Linear_Regression.Rout"    
 [9] "07_Non-Linear_Reg.R"           "07_Non-Linear_Reg.Rout"       
[11] "08_Regression_Trees.R"         "08_Regression_Trees.Rout"     
[13] "10_Case_Study_Concrete.R"      "10_Case_Study_Concrete.Rout"  
[15] "11_Class_Performance.R"        "11_Class_Performance.Rout"    
[17] "12_Discriminant_Analysis.R"    "12_Discriminant_Analysis.Rout"
[19] "13_Non-Linear_Class.R"         "13_Non-Linear_Class.Rout"     
[21] "14_Class_Trees.R"              "14_Class_Trees.Rout"          
[23] "16_Class_Imbalance.R"          "16_Class_Imbalance.Rout"      
[25] "17_Job_Scheduling.R"           "17_Job_Scheduling.Rout"       
[27] "18_Importance.R"               "18_Importance.Rout"           
[29] "19_Feature_Select.R"           "19_Feature_Select.Rout"       
[31] "CreateGrantData.R"             "CreateGrantData.Rout"         

> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Jo .... [TRUNCATED] 

> data(FuelEconomy)

> ## Format data for plotting against engine displacement
> 
> ## Sort by engine displacement
> cars2010 <- cars2010[order(cars2010$EngDispl),]

> cars2011 <- cars2011[order(cars2011$EngDispl),]

> ## Combine data into one data frame
> cars2010a <- cars2010

> cars2010a$Year <- "2010 Model Year"

> cars2011a <- cars2011

> cars2011a$Year <- "2011 Model Year"

> plotData <- rbind(cars2010a, cars2011a)

> library(lattice)

> xyplot(FE ~ EngDispl|Year, plotData,
+        xlab = "Engine Displacement",
+        ylab = "Fuel Efficiency (MPG)",
+        between = list(x = 1.2 .... [TRUNCATED] 

> ## Fit a single linear model and conduct 10-fold CV to estimate the error
> library(caret)

> set.seed(1)

> lm1Fit <- train(FE ~ EngDispl, 
+                 data = cars2010,
+                 method = "lm", 
+                 trControl = trainControl(meth .... [TRUNCATED] 

> lm1Fit
Linear Regression 

1107 samples
  13 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 997, 996, 995, 996, 997, 996, ... 

Resampling results

  RMSE      Rsquared  RMSE SD   Rsquared SD
  4.604285  0.628494  0.492878  0.04418925 

 

> ## Fit a quadratic model too
> 
> ## Create squared terms
> cars2010$ED2 <- cars2010$EngDispl^2

> cars2011$ED2 <- cars2011$EngDispl^2

> set.seed(1)

> lm2Fit <- train(FE ~ EngDispl + ED2, 
+                 data = cars2010,
+                 method = "lm", 
+                 trControl = trainContro .... [TRUNCATED] 

> lm2Fit
Linear Regression 

1107 samples
  14 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 997, 996, 995, 996, 997, 996, ... 

Resampling results

  RMSE      Rsquared   RMSE SD    Rsquared SD
  4.228432  0.6843226  0.4194454  0.04210009 

 

> ## Finally a MARS model (via the earth package)
> 
> library(earth)
Loading required package: plotmo
Loading required package: plotrix

> set.seed(1)

> marsFit <- train(FE ~ EngDispl, 
+                  data = cars2010,
+                  method = "earth",
+                  tuneLength = 15,
+      .... [TRUNCATED] 

> marsFit
Multivariate Adaptive Regression Spline 

1107 samples
  14 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 997, 996, 995, 996, 997, 996, ... 

Resampling results across tuning parameters:

  nprune  RMSE      Rsquared   RMSE SD    Rsquared SD
  2       4.295551  0.6734579  0.4412493  0.04289014 
  3       4.255755  0.6802699  0.4403794  0.03947172 
  4       4.228066  0.6845448  0.4488977  0.04278739 
  5       4.249977  0.6820430  0.4886947  0.04318735 

Tuning parameter 'degree' was held constant at a value of 1
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were nprune = 4 and degree = 1. 

> plot(marsFit)

> ## Predict the test set data
> cars2011$lm1  <- predict(lm1Fit,  cars2011)

> cars2011$lm2  <- predict(lm2Fit,  cars2011)

> cars2011$mars <- predict(marsFit, cars2011)

> ## Get test set performance values via caret's postResample function
> 
> postResample(pred = cars2011$lm1,  obs = cars2011$FE)
     RMSE  Rsquared 
5.1625309 0.7018642 

> postResample(pred = cars2011$lm2,  obs = cars2011$FE)
     RMSE  Rsquared 
4.7162853 0.7486074 

> postResample(pred = cars2011$mars, obs = cars2011$FE)
     RMSE  Rsquared 
4.6855501 0.7499953 

> ################################################################################
> ### Session Information
> 
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] C

attached base packages:
[1] parallel  tools     stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] earth_4.2.0                     plotrix_3.5-11                 
 [3] plotmo_2.2.1                    pROC_1.7.3                     
 [5] doMC_1.3.3                      iterators_1.0.7                
 [7] foreach_1.4.2                   AppliedPredictiveModeling_1.1-6
 [9] caret_6.0-41                    ggplot2_1.0.1                  
[11] lattice_0.20-31                

loaded via a namespace (and not attached):
 [1] BradleyTerry2_1.0-6 CORElearn_0.9.45    MASS_7.3-40        
 [4] Matrix_1.1-5        Rcpp_0.11.5         SparseM_1.6        
 [7] brglm_0.5-9         car_2.0-25          class_7.3-12       
[10] cluster_2.0.1       codetools_0.2-10    colorspace_1.2-6   
[13] compiler_3.1.3      digest_0.6.8        e1071_1.6-4        
[16] grid_3.1.3          gtable_0.1.2        gtools_3.4.1       
[19] lme4_1.1-7          mgcv_1.8-4          minqa_1.2.4        
[22] munsell_0.4.2       nlme_3.1-120        nloptr_1.0.4       
[25] nnet_7.3-9          pbkrtest_0.4-2      plyr_1.8.1         
[28] proto_0.3-10        quantreg_5.11       reshape2_1.4.1     
[31] rpart_4.1-9         scales_0.2.4        splines_3.1.3      
[34] stringr_0.6.2      

> ### q("no")
> 
> 
$value
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] C

attached base packages:
[1] parallel  tools     stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] earth_4.2.0                     plotrix_3.5-11                 
 [3] plotmo_2.2.1                    pROC_1.7.3                     
 [5] doMC_1.3.3                      iterators_1.0.7                
 [7] foreach_1.4.2                   AppliedPredictiveModeling_1.1-6
 [9] caret_6.0-41                    ggplot2_1.0.1                  
[11] lattice_0.20-31                

loaded via a namespace (and not attached):
 [1] BradleyTerry2_1.0-6 CORElearn_0.9.45    MASS_7.3-40        
 [4] Matrix_1.1-5        Rcpp_0.11.5         SparseM_1.6        
 [7] brglm_0.5-9         car_2.0-25          class_7.3-12       
[10] cluster_2.0.1       codetools_0.2-10    colorspace_1.2-6   
[13] compiler_3.1.3      digest_0.6.8        e1071_1.6-4        
[16] grid_3.1.3          gtable_0.1.2        gtools_3.4.1       
[19] lme4_1.1-7          mgcv_1.8-4          minqa_1.2.4        
[22] munsell_0.4.2       nlme_3.1-120        nloptr_1.0.4       
[25] nnet_7.3-9          pbkrtest_0.4-2      plyr_1.8.1         
[28] proto_0.3-10        quantreg_5.11       reshape2_1.4.1     
[31] rpart_4.1-9         scales_0.2.4        splines_3.1.3      
[34] stringr_0.6.2      

$visible
[1] TRUE

In [137]:
%%R

## Another way to run the Chapter 2 script

library(AppliedPredictiveModeling)
data(FuelEconomy)

## Format data for plotting against engine displacement

## Sort by engine displacement
cars2010 <- cars2010[order(cars2010$EngDispl),]
cars2011 <- cars2011[order(cars2011$EngDispl),]

## Combine data into one data frame
cars2010a <- cars2010
cars2010a$Year <- "2010 Model Year"
cars2011a <- cars2011
cars2011a$Year <- "2011 Model Year"

plotData <- rbind(cars2010a, cars2011a)

library(lattice)

print(
    xyplot(FE ~ EngDispl|Year, plotData,
       xlab = "Engine Displacement",
       ylab = "Fuel Efficiency (MPG)",
       between = list(x = 1.2))
)

##########  'plot' routines in the lattice package must be print'ed to obtain their output !

## Fit a single linear model and conduct 10-fold CV to estimate the error

library(caret)
set.seed(1)
lm1Fit <- train(FE ~ EngDispl,
                data = cars2010,
                method = "lm",
                trControl = trainControl(method= "cv"))
print(lm1Fit)


## Fit a quadratic model too

## Create squared terms
cars2010$ED2 <- cars2010$EngDispl^2
cars2011$ED2 <- cars2011$EngDispl^2

set.seed(1)
lm2Fit <- train(FE ~ EngDispl + ED2,
                data = cars2010,
                method = "lm",
                trControl = trainControl(method= "cv"))
print(lm2Fit)
Linear Regression 

1107 samples
  13 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 997, 996, 995, 996, 997, 996, ... 

Resampling results

  RMSE      Rsquared  RMSE SD   Rsquared SD
  4.604285  0.628494  0.492878  0.04418925 

 
Linear Regression 

1107 samples
  14 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 997, 996, 995, 996, 997, 996, ... 

Resampling results

  RMSE      Rsquared   RMSE SD    Rsquared SD
  4.228432  0.6843226  0.4194454  0.04210009 

 
In [134]:
%%R

## Finally a MARS model (via the earth package)

library(earth)
set.seed(1)
marsFit <- train(FE ~ EngDispl,
                 data = cars2010,
                 method = "earth",
                 tuneLength = 15,
                 trControl = trainControl(method= "cv"))
print(marsFit)


plot(marsFit)
Multivariate Adaptive Regression Spline 

1107 samples
  14 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 997, 996, 995, 996, 997, 996, ... 

Resampling results across tuning parameters:

  nprune  RMSE      Rsquared   RMSE SD    Rsquared SD
  2       4.295551  0.6734579  0.4412493  0.04289014 
  3       4.255755  0.6802699  0.4403794  0.03947172 
  4       4.228066  0.6845448  0.4488977  0.04278739 
  5       4.249977  0.6820430  0.4886947  0.04318735 

Tuning parameter 'degree' was held constant at a value of 1
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were nprune = 4 and degree = 1. 
In [132]:
%%R

## Predict the test set data
cars2011$lm1  <- predict(lm1Fit,  cars2011)
cars2011$lm2  <- predict(lm2Fit,  cars2011)
cars2011$mars <- predict(marsFit, cars2011)

## Get test set performance values via caret's postResample function

print(postResample(pred = cars2011$lm1,  obs = cars2011$FE))
print(postResample(pred = cars2011$lm2,  obs = cars2011$FE))
print(postResample(pred = cars2011$mars, obs = cars2011$FE))
     RMSE  Rsquared 
5.1625309 0.7018642 
     RMSE  Rsquared 
4.7162853 0.7486074 
     RMSE  Rsquared 
4.6855501 0.7499953 
In [45]:
%%R

showChapterScript(3)
NULL
In [46]:
%%R

showChapterOutput(3)
NULL
In [123]:
%%R -w 600 -h 600

runChapterScript(3)

##    user  system elapsed 
##   5.791   0.147   6.146
NULL
In [50]:
%%R

### Section 3.1 Case Study: Cell Segmentation in High-Content Screening

library(AppliedPredictiveModeling)
data(segmentationOriginal)

## Retain the original training set
segTrain <- subset(segmentationOriginal, Case == "Train")

## Remove the first three columns (identifier columns)
segTrainX <- segTrain[, -(1:3)]
segTrainClass <- segTrain$Class

print(colnames(segTrain))

print(table(segTrainClass))
  [1] "Cell"                          "Case"                         
  [3] "Class"                         "AngleCh1"                     
  [5] "AngleStatusCh1"                "AreaCh1"                      
  [7] "AreaStatusCh1"                 "AvgIntenCh1"                  
  [9] "AvgIntenCh2"                   "AvgIntenCh3"                  
 [11] "AvgIntenCh4"                   "AvgIntenStatusCh1"            
 [13] "AvgIntenStatusCh2"             "AvgIntenStatusCh3"            
 [15] "AvgIntenStatusCh4"             "ConvexHullAreaRatioCh1"       
 [17] "ConvexHullAreaRatioStatusCh1"  "ConvexHullPerimRatioCh1"      
 [19] "ConvexHullPerimRatioStatusCh1" "DiffIntenDensityCh1"          
 [21] "DiffIntenDensityCh3"           "DiffIntenDensityCh4"          
 [23] "DiffIntenDensityStatusCh1"     "DiffIntenDensityStatusCh3"    
 [25] "DiffIntenDensityStatusCh4"     "EntropyIntenCh1"              
 [27] "EntropyIntenCh3"               "EntropyIntenCh4"              
 [29] "EntropyIntenStatusCh1"         "EntropyIntenStatusCh3"        
 [31] "EntropyIntenStatusCh4"         "EqCircDiamCh1"                
 [33] "EqCircDiamStatusCh1"           "EqEllipseLWRCh1"              
 [35] "EqEllipseLWRStatusCh1"         "EqEllipseOblateVolCh1"        
 [37] "EqEllipseOblateVolStatusCh1"   "EqEllipseProlateVolCh1"       
 [39] "EqEllipseProlateVolStatusCh1"  "EqSphereAreaCh1"              
 [41] "EqSphereAreaStatusCh1"         "EqSphereVolCh1"               
 [43] "EqSphereVolStatusCh1"          "FiberAlign2Ch3"               
 [45] "FiberAlign2Ch4"                "FiberAlign2StatusCh3"         
 [47] "FiberAlign2StatusCh4"          "FiberLengthCh1"               
 [49] "FiberLengthStatusCh1"          "FiberWidthCh1"                
 [51] "FiberWidthStatusCh1"           "IntenCoocASMCh3"              
 [53] "IntenCoocASMCh4"               "IntenCoocASMStatusCh3"        
 [55] "IntenCoocASMStatusCh4"         "IntenCoocContrastCh3"         
 [57] "IntenCoocContrastCh4"          "IntenCoocContrastStatusCh3"   
 [59] "IntenCoocContrastStatusCh4"    "IntenCoocEntropyCh3"          
 [61] "IntenCoocEntropyCh4"           "IntenCoocEntropyStatusCh3"    
 [63] "IntenCoocEntropyStatusCh4"     "IntenCoocMaxCh3"              
 [65] "IntenCoocMaxCh4"               "IntenCoocMaxStatusCh3"        
 [67] "IntenCoocMaxStatusCh4"         "KurtIntenCh1"                 
 [69] "KurtIntenCh3"                  "KurtIntenCh4"                 
 [71] "KurtIntenStatusCh1"            "KurtIntenStatusCh3"           
 [73] "KurtIntenStatusCh4"            "LengthCh1"                    
 [75] "LengthStatusCh1"               "MemberAvgAvgIntenStatusCh2"   
 [77] "MemberAvgTotalIntenStatusCh2"  "NeighborAvgDistCh1"           
 [79] "NeighborAvgDistStatusCh1"      "NeighborMinDistCh1"           
 [81] "NeighborMinDistStatusCh1"      "NeighborVarDistCh1"           
 [83] "NeighborVarDistStatusCh1"      "PerimCh1"                     
 [85] "PerimStatusCh1"                "ShapeBFRCh1"                  
 [87] "ShapeBFRStatusCh1"             "ShapeLWRCh1"                  
 [89] "ShapeLWRStatusCh1"             "ShapeP2ACh1"                  
 [91] "ShapeP2AStatusCh1"             "SkewIntenCh1"                 
 [93] "SkewIntenCh3"                  "SkewIntenCh4"                 
 [95] "SkewIntenStatusCh1"            "SkewIntenStatusCh3"           
 [97] "SkewIntenStatusCh4"            "SpotFiberCountCh3"            
 [99] "SpotFiberCountCh4"             "SpotFiberCountStatusCh3"      
[101] "SpotFiberCountStatusCh4"       "TotalIntenCh1"                
[103] "TotalIntenCh2"                 "TotalIntenCh3"                
[105] "TotalIntenCh4"                 "TotalIntenStatusCh1"          
[107] "TotalIntenStatusCh2"           "TotalIntenStatusCh3"          
[109] "TotalIntenStatusCh4"           "VarIntenCh1"                  
[111] "VarIntenCh3"                   "VarIntenCh4"                  
[113] "VarIntenStatusCh1"             "VarIntenStatusCh3"            
[115] "VarIntenStatusCh4"             "WidthCh1"                     
[117] "WidthStatusCh1"                "XCentroid"                    
[119] "YCentroid"                    
segTrainClass
 PS  WS 
636 373 
In [58]:
%%R

### Section 3.2 Data Transformations for Individual Predictors

## The column VarIntenCh3 measures the standard deviation of the intensity
## of the pixels in the actin filaments

max(segTrainX$VarIntenCh3)/min(segTrainX$VarIntenCh3)

library(e1071)
skewness(segTrainX$VarIntenCh3)

library(caret)

## Use caret's preProcess function to transform for skewness
segPP <- preProcess(segTrainX, method = "BoxCox")

## Apply the transformations
segTrainTrans <- predict(segPP, segTrainX)

## Results for a single predictor
segPP$bc$VarIntenCh3

print(
histogram(~segTrainX$VarIntenCh3,
          xlab = "Natural Units",
          type = "count")
)
print(
histogram(~log(segTrainX$VarIntenCh3),
          xlab = "Log Units",
          ylab = " ",
          type = "count")
)
print(
segPP$bc$PerimCh1
)
print(
histogram(~segTrainX$PerimCh1,
          xlab = "Natural Units",
          type = "count")
)
print(
histogram(~segTrainTrans$PerimCh1,
          xlab = "Transformed Data",
          ylab = " ",
          type = "count")
)
Box-Cox Transformation

1009 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  47.74   64.37   79.02   91.61  103.20  459.80 

Largest/Smallest: 9.63 
Sample Skewness: 2.59 

Estimated Lambda: -1.1 

In [52]:
%%R

### Section 3.3 Data Transformations for Multiple Predictors

## R's prcomp is used to conduct PCA
pr <- prcomp(~ AvgIntenCh1 + EntropyIntenCh1,
             data = segTrainTrans,
             scale. = TRUE)


transparentTheme(pchSize = .7, trans = .3)

print(
    xyplot(AvgIntenCh1 ~ EntropyIntenCh1,
       data = segTrainTrans,
       groups = segTrain$Class,
       xlab = "Channel 1 Fiber Width",
       ylab = "Intensity Entropy Channel 1",
       auto.key = list(columns = 2),
       type = c("p", "g"),
       main = "Original Data",
       aspect = 1)
)
print(
xyplot(PC2 ~ PC1,
       data = as.data.frame(pr$x),
       groups = segTrain$Class,
       xlab = "Principal Component #1",
       ylab = "Principal Component #2",
       main = "Transformed",
       xlim = extendrange(pr$x),
       ylim = extendrange(pr$x),
       type = c("p", "g"),
       aspect = 1)
)

## Apply PCA to the entire set of predictors.

## There are a few predictors with only a single value, so we remove these first
## (since PCA uses variances, which would be zero)

isZV <- apply(segTrainX, 2, function(x) length(unique(x)) == 1)
segTrainX <- segTrainX[, !isZV]

segPP <- preProcess(segTrainX, c("BoxCox", "center", "scale"))
segTrainTrans <- predict(segPP, segTrainX)

segPCA <- prcomp(segTrainTrans, center = TRUE, scale. = TRUE)

## Plot a scatterplot matrix of the first three components
transparentTheme(pchSize = .8, trans = .3)

panelRange <- extendrange(segPCA$x[, 1:3])
print(
 splom(as.data.frame(segPCA$x[, 1:3]),
      groups = segTrainClass,
      type = c("p", "g"),
      as.table = TRUE,
      auto.key = list(columns = 2),
      prepanel.limits = function(x) panelRange)
)
## Format the rotation values for plotting
segRot <- as.data.frame(segPCA$rotation[, 1:3])

## Derive the channel variable
vars <- rownames(segPCA$rotation)
channel <- rep(NA, length(vars))
channel[grepl("Ch1$", vars)] <- "Channel 1"
channel[grepl("Ch2$", vars)] <- "Channel 2"
channel[grepl("Ch3$", vars)] <- "Channel 3"
channel[grepl("Ch4$", vars)] <- "Channel 4"

segRot$Channel <- channel
segRot <- segRot[complete.cases(segRot),]

segRot$Channel <- factor(as.character(segRot$Channel))

## Plot a scatterplot matrix of the first three rotation variables

transparentTheme(pchSize = .8, trans = .7)
panelRange <- extendrange(segRot[, 1:3])
library(ellipse)
upperp <- function(...)
  {
    args <- list(...)
    circ1 <- ellipse(diag(rep(1, 2)), t = .1)
    panel.xyplot(circ1[,1], circ1[,2],
                 type = "l",
                 lty = trellis.par.get("reference.line")$lty,
                 col = trellis.par.get("reference.line")$col,
                 lwd = trellis.par.get("reference.line")$lwd)
    circ2 <- ellipse(diag(rep(1, 2)), t = .2)
    panel.xyplot(circ2[,1], circ2[,2],
                 type = "l",
                 lty = trellis.par.get("reference.line")$lty,
                 col = trellis.par.get("reference.line")$col,
                 lwd = trellis.par.get("reference.line")$lwd)
    circ3 <- ellipse(diag(rep(1, 2)), t = .3)
    panel.xyplot(circ3[,1], circ3[,2],
                 type = "l",
                 lty = trellis.par.get("reference.line")$lty,
                 col = trellis.par.get("reference.line")$col,
                 lwd = trellis.par.get("reference.line")$lwd)
    panel.xyplot(args$x, args$y, groups = args$groups, subscripts = args$subscripts)
  }
          
print(
splom(~segRot[, 1:3],
      groups = segRot$Channel,
      lower.panel = function(...){}, upper.panel = upperp,
      prepanel.limits = function(x) panelRange,
      auto.key = list(columns = 2))
)
In [54]:
%%R

### Section 3.5 Removing Variables

## To filter on correlations, we first get the correlation matrix for the
## predictor set

segCorr <- cor(segTrainTrans)

library(corrplot)
corrplot(segCorr, order = "hclust", tl.cex = .35)

## caret's findCorrelation function is used to identify columns to remove.
highCorr <- findCorrelation(segCorr, .75)

print(highCorr)
 [1]  85  45 100  13  79   8  19  25  97  71  35  99   5   6  29  39  37   3  17
[20] 105  57  61  49  58   7  62  50  18  89  31   9 102   4  38  34  52  51 108
[39]  40  88  87  22  73
In [57]:
%%R

### Section 3.8 Computing (Creating Dummy Variables)

data(cars)
type <- c("convertible", "coupe", "hatchback", "sedan", "wagon")
cars$Type <- factor(apply(cars[, 14:18], 1, function(x) type[which(x == 1)]))

carSubset <- cars[sample(1:nrow(cars), 20), c(1, 2, 19)]

print(
    head(carSubset)
)
print(
    levels(carSubset$Type)
)
       Price Mileage        Type
759 13540.04   17343       sedan
303 18912.98   21512       sedan
765 15623.92   21272       sedan
219 33540.54   20925 convertible
550 22064.29   27384       sedan
110 11903.10   25285       coupe
[1] "convertible" "coupe"       "hatchback"   "sedan"       "wagon"      
In [56]:
%%R

simpleMod <- dummyVars(~Mileage + Type,
                       data = carSubset,
                       ## Remove the variable name from the
                       ## column name
                       levelsOnly = TRUE)
print(
    simpleMod
)

withInteraction <- dummyVars(~Mileage + Type + Mileage:Type,
                             data = carSubset,
                             levelsOnly = TRUE)
print(
    withInteraction
)
print(
    predict(withInteraction, head(carSubset))
)
Dummy Variable Object

Formula: ~Mileage + Type
2 variables, 1 factors
Factor variable names will be removed
A less than full rank encoding is used
Dummy Variable Object

Formula: ~Mileage + Type + Mileage:Type
2 variables, 1 factors
Factor variable names will be removed
A less than full rank encoding is used
    Mileage convertible coupe hatchback sedan wagon Mileage:convertible
635    9049           0     0         0     1     0                   0
421   17870           0     0         0     1     0                   0
257   26700           0     1         0     0     0                   0
221   10340           1     0         0     0     0               10340
642   25557           0     0         0     1     0                   0
84    13776           0     1         0     0     0                   0
    Mileage:coupe Mileage:hatchback Mileage:sedan Mileage:wagon
635             0                 0          9049             0
421             0                 0         17870             0
257         26700                 0             0             0
221             0                 0             0             0
642             0                 0         25557             0
84          13776                 0             0             0
In [59]:
%%R

showChapterScript(4)
NULL
In [124]:
%%R

showChapterOutput(4)
NULL
In [46]:
%%R -w 600 -h 600

runChapterScript(4)
> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Jo .... [TRUNCATED] 

> data(GermanCredit)

> ## First, remove near-zero variance predictors then get rid of a few predictors 
> ## that duplicate values. For example, there are two possible val .... [TRUNCATED] 

> GermanCredit$CheckingAccountStatus.lt.0 <- NULL

> GermanCredit$SavingsAccountBonds.lt.100 <- NULL

> GermanCredit$EmploymentDuration.lt.1 <- NULL

> GermanCredit$EmploymentDuration.Unemployed <- NULL

> GermanCredit$Personal.Male.Married.Widowed <- NULL

> GermanCredit$Property.Unknown <- NULL

> GermanCredit$Housing.ForFree <- NULL

> ## Split the data into training (80%) and test sets (20%)
> set.seed(100)

> inTrain <- createDataPartition(GermanCredit$Class, p = .8)[[1]]

> GermanCreditTrain <- GermanCredit[ inTrain, ]

> GermanCreditTest  <- GermanCredit[-inTrain, ]

> ## The model fitting code shown in the computing section is fairly
> ## simplistic.  For the text we estimate the tuning parameter grid
> ## up-fron .... [TRUNCATED] 

> set.seed(231)

> sigDist <- sigest(Class ~ ., data = GermanCreditTrain, frac = 1)

> svmTuneGrid <- data.frame(sigma = as.vector(sigDist)[1], C = 2^(-2:7))

> ### Optional: parallel processing can be used via the 'do' packages,
> ### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
> ### up  .... [TRUNCATED] 

> svmFit <- train(Class ~ .,
+                 data = GermanCreditTrain,
+                 method = "svmRadial",
+                 preProc = c("center ..." ... [TRUNCATED] 

> ## classProbs = TRUE was added since the text was written
> 
> ## Print the results
> svmFit
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictor
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 

Resampling results across tuning parameters:

  C       Accuracy  Kappa      Accuracy SD  Kappa SD 
    0.25  0.74125   0.3515540  0.05046025   0.1175042
    0.50  0.74050   0.3462643  0.05178941   0.1205921
    1.00  0.74475   0.3441089  0.05070234   0.1194702
    2.00  0.74175   0.3209028  0.04681229   0.1193335
    4.00  0.74275   0.3160328  0.04890967   0.1220800
    8.00  0.75325   0.3389174  0.04836682   0.1291946
   16.00  0.74700   0.3081410  0.04428859   0.1252361
   32.00  0.74200   0.2922277  0.04466142   0.1274896
   64.00  0.73975   0.2727270  0.04451338   0.1371257
  128.00  0.73650   0.2763129  0.04495179   0.1278093

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.008918477 and C = 8. 

> ## A line plot of the average performance. The 'scales' argument is actually an 
> ## argument to xyplot that converts the x-axis to log-2 units.
>  .... [TRUNCATED] 

> ## Test set predictions
> 
> predictedClasses <- predict(svmFit, GermanCreditTest)

> str(predictedClasses)
 Factor w/ 2 levels "Bad","Good": 1 2 2 2 1 2 2 2 1 1 ...

> ## Use the "type" option to get class probabilities
> 
> predictedProbs <- predict(svmFit, newdata = GermanCreditTest, type = "prob")

> head(predictedProbs)
         Bad      Good
1 0.58917636 0.4108236
2 0.49818809 0.5018119
3 0.31073860 0.6892614
4 0.08949224 0.9105078
5 0.60453392 0.3954661
6 0.13487103 0.8651290

> ## Fit the same model using different resampling methods. The main syntax change
> ## is the control object.
> 
> set.seed(1056)

> svmFit10CV <- train(Class ~ .,
+                     data = GermanCreditTrain,
+                     method = "svmRadial",
+                     pre .... [TRUNCATED] 

> svmFit10CV
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictor
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 

Resampling results across tuning parameters:

  C       Accuracy  Kappa       Accuracy SD  Kappa SD  
    0.25  0.70000   0.00000000  0.00000000   0.00000000
    0.50  0.71875   0.09343326  0.01886539   0.07094452
    1.00  0.74375   0.27692135  0.02224391   0.07950763
    2.00  0.75875   0.36149069  0.03230175   0.07626079
    4.00  0.75500   0.36809516  0.04216370   0.11887279
    8.00  0.76125   0.39541476  0.03653860   0.10447322
   16.00  0.76625   0.41855404  0.04168749   0.11283531
   32.00  0.74875   0.38824618  0.04427267   0.10316210
   64.00  0.72875   0.34921040  0.04715886   0.10823541
  128.00  0.72875   0.35220213  0.04678927   0.10785380

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.008918477 and C = 16. 

> set.seed(1056)

> svmFitLOO <- train(Class ~ .,
+                    data = GermanCreditTrain,
+                    method = "svmRadial",
+                    preProc .... [TRUNCATED] 

> svmFitLOO
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictor
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: 

Summary of sample sizes: 799, 799, 799, 799, 799, 799, ... 

Resampling results across tuning parameters:

  C       Accuracy  Kappa    
    0.25  0.70000   0.0000000
    0.50  0.71750   0.1003185
    1.00  0.74875   0.3049793
    2.00  0.74000   0.3157895
    4.00  0.74875   0.3582375
    8.00  0.76125   0.4068323
   16.00  0.76125   0.4169719
   32.00  0.72250   0.3345324
   64.00  0.71625   0.3268090
  128.00  0.72000   0.3333333

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.008918477 and C = 8. 

> set.seed(1056)

> svmFitLGO <- train(Class ~ .,
+                    data = GermanCreditTrain,
+                    method = "svmRadial",
+                    preProc .... [TRUNCATED] 

> svmFitLGO 
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictor
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (50 reps, 0.8%) 

Summary of sample sizes: 640, 640, 640, 640, 640, 640, ... 

Resampling results across tuning parameters:

  C       Accuracy  Kappa       Accuracy SD  Kappa SD  
    0.25  0.700000  0.00000000  0.000000000  0.00000000
    0.50  0.711125  0.06691009  0.009557326  0.03877930
    1.00  0.737000  0.25887472  0.022440397  0.06320724
    2.00  0.740750  0.31816867  0.023765435  0.06074014
    4.00  0.743125  0.35076031  0.028071803  0.06804724
    8.00  0.745000  0.36985984  0.025222227  0.06174940
   16.00  0.738500  0.36501972  0.030445250  0.07631435
   32.00  0.729375  0.34893389  0.029646353  0.07227117
   64.00  0.721500  0.33509585  0.029346627  0.07130233
  128.00  0.714375  0.32063672  0.030389951  0.07486036

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.008918477 and C = 8. 

> set.seed(1056)

> svmFitBoot <- train(Class ~ .,
+                     data = GermanCreditTrain,
+                     method = "svmRadial",
+                     pre .... [TRUNCATED] 

> svmFitBoot
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictor
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: Bootstrapped (50 reps) 

Summary of sample sizes: 800, 800, 800, 800, 800, 800, ... 

Resampling results across tuning parameters:

  C       Accuracy   Kappa       Accuracy SD  Kappa SD  
    0.25  0.7040934  0.01896068  0.02637422   0.03273562
    0.50  0.7275975  0.18611337  0.03062648   0.08794391
    1.00  0.7388778  0.29026235  0.02445672   0.06765864
    2.00  0.7420822  0.32895315  0.01767895   0.05040255
    4.00  0.7421938  0.34486682  0.01833609   0.04747891
    8.00  0.7405316  0.35362257  0.01907557   0.05017752
   16.00  0.7349648  0.34738355  0.01916738   0.04500902
   32.00  0.7294466  0.34058430  0.02168677   0.04904437
   64.00  0.7234922  0.32974005  0.02297203   0.05086115
  128.00  0.7209653  0.32439609  0.02321969   0.05087069

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.008918477 and C = 4. 

> set.seed(1056)

> svmFitBoot632 <- train(Class ~ .,
+                        data = GermanCreditTrain,
+                        method = "svmRadial",
+                .... [TRUNCATED] 

> svmFitBoot632
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictor
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: Bootstrapped (50 reps) 

Summary of sample sizes: 800, 800, 800, 800, 800, 800, ... 

Resampling results across tuning parameters:

  C       Accuracy   Kappa       Accuracy SD  Kappa SD  
    0.25  0.7025875  0.01198544  0.02637422   0.03273562
    0.50  0.7330798  0.18856955  0.03062648   0.08794391
    1.00  0.7655020  0.35980922  0.02445672   0.06765864
    2.00  0.7827026  0.43450963  0.01767895   0.05040255
    4.00  0.7979482  0.48754429  0.01833609   0.04747891
    8.00  0.8102331  0.52782744  0.01907557   0.05017752
   16.00  0.8177506  0.55166437  0.01916738   0.04500902
   32.00  0.8229996  0.56881674  0.02168677   0.04904437
   64.00  0.8219948  0.56862328  0.02297203   0.05086115
  128.00  0.8226967  0.57074712  0.02321969   0.05087069

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.008918477 and C = 32. 

> ################################################################################
> ### Section 4.8 Choosing Between Models
> 
> set.seed(1056)

> glmProfile <- train(Class ~ .,
+                     data = GermanCreditTrain,
+                     method = "glm",
+                     trControl .... [TRUNCATED] 

> glmProfile
Generalized Linear Model 

800 samples
 41 predictor
  2 classes: 'Bad', 'Good' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 

Resampling results

  Accuracy  Kappa      Accuracy SD  Kappa SD 
  0.749     0.3647664  0.05162166   0.1218109

 

> resamp <- resamples(list(SVM = svmFit, Logistic = glmProfile))

> summary(resamp)

Call:
summary.resamples(object = resamp)

Models: SVM, Logistic 
Number of resamples: 50 

Accuracy 
           Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
SVM      0.6500   0.725 0.7625 0.7532  0.7969 0.8375    0
Logistic 0.6125   0.725 0.7562 0.7490  0.7844 0.8500    0

Kappa 
            Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
SVM      0.02778  0.2445 0.3667 0.3389  0.4444 0.5548    0
Logistic 0.07534  0.2831 0.3750 0.3648  0.4504 0.6250    0


> ## These results are slightly different from those shown in the text.
> ## There are some differences in the train() function since the 
> ## origin .... [TRUNCATED] 

> summary(modelDifferences)

Call:
summary.diff.resamples(object = modelDifferences)

p-value adjustment: bonferroni 
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0

Accuracy 
         SVM    Logistic
SVM             0.00425 
Logistic 0.4585         

Kappa 
         SVM     Logistic
SVM              -0.02585
Logistic 0.07948         


> ## The actual paired t-test:
> modelDifferences$statistics$Accuracy
$SVM.diff.Logistic

	One Sample t-test

data:  x
t = 0.7472, df = 49, p-value = 0.4585
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.007179558  0.015679558
sample estimates:
mean of x 
  0.00425 



> ################################################################################
> ### Session Information
> 
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] C

attached base packages:
[1] parallel  tools     stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] kernlab_0.9-20                  corrplot_0.73                  
 [3] ellipse_0.3-8                   e1071_1.6-4                    
 [5] earth_4.2.0                     plotrix_3.5-11                 
 [7] plotmo_2.2.1                    doMC_1.3.3                     
 [9] iterators_1.0.7                 foreach_1.4.2                  
[11] AppliedPredictiveModeling_1.1-6 caret_6.0-41                   
[13] ggplot2_1.0.1                   lattice_0.20-31                

loaded via a namespace (and not attached):
 [1] BradleyTerry2_1.0-6 CORElearn_0.9.45    MASS_7.3-40        
 [4] Matrix_1.1-5        Rcpp_0.11.5         SparseM_1.6        
 [7] brglm_0.5-9         car_2.0-25          class_7.3-12       
[10] cluster_2.0.1       codetools_0.2-10    colorspace_1.2-6   
[13] compiler_3.1.3      digest_0.6.8        grid_3.1.3         
[16] gtable_0.1.2        gtools_3.4.1        lme4_1.1-7         
[19] mgcv_1.8-4          minqa_1.2.4         munsell_0.4.2      
[22] nlme_3.1-120        nloptr_1.0.4        nnet_7.3-9         
[25] pbkrtest_0.4-2      plyr_1.8.1          proto_0.3-10       
[28] quantreg_5.11       reshape2_1.4.1      rpart_4.1-9        
[31] scales_0.2.4        splines_3.1.3       stringr_0.6.2      

> ### q("no")
> 
> 
> 
In [52]:
%%R

minutes_required_for_previous_script = 3260.432 / 60
print(minutes_required_for_previous_script)

## user   system  elapsed 
## 3260.432  211.968  906.933
[1] 54.34053
In [125]:
%%R

######## This computation can take five minutes to complete on a single cpu.

### Section 4.6 Choosing Final Tuning Parameters

detach(package:caret)  # reload the package, since the code here modifies GermanCredit
library(caret)
data(GermanCredit)

## First, remove near-zero variance predictors then get rid of a few predictors
## that duplicate values. For example, there are two possible values for the
## housing variable: "Rent", "Own" and "ForFree". So that we don't have linear
## dependencies, we get rid of one of the levels (e.g. "ForFree")

GermanCredit <- GermanCredit[, -nearZeroVar(GermanCredit)]
GermanCredit$CheckingAccountStatus.lt.0 <- NULL
GermanCredit$SavingsAccountBonds.lt.100 <- NULL
GermanCredit$EmploymentDuration.lt.1 <- NULL
GermanCredit$EmploymentDuration.Unemployed <- NULL
GermanCredit$Personal.Male.Married.Widowed <- NULL
GermanCredit$Property.Unknown <- NULL
GermanCredit$Housing.ForFree <- NULL

## Split the data into training (80%) and test sets (20%)
set.seed(100)
inTrain <- createDataPartition(GermanCredit$Class, p = .8)[[1]]
GermanCreditTrain <- GermanCredit[ inTrain, ]
GermanCreditTest  <- GermanCredit[-inTrain, ]

## The model fitting code shown in the computing section is fairly
## simplistic.  For the text we estimate the tuning parameter grid
## up-front and pass it in explicitly. This generally is not needed,
## but was used here so that we could trim the cost values to a
## presentable range and to re-use later with different resampling
## methods.

library(kernlab)
set.seed(231)
sigDist <- sigest(Class ~ ., data = GermanCreditTrain, frac = 1)
svmTuneGrid <- data.frame(sigma = as.vector(sigDist)[1], C = 2^(-2:7))

### Optional: parallel processing can be used via the 'do' packages,
### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
### up the computations.

### WARNING: Be aware of how much memory is needed to parallel
### process. It can very quickly overwhelm the available hardware. We
### estimate the memory usage (VSIZE = total memory size) to be
### 2566M/core.

### library(doMC)
### registerDoMC(4)

set.seed(1056)
svmFit <- train(Class ~ .,
                data = GermanCreditTrain,
                method = "svmRadial",
                preProc = c("center", "scale"),
                tuneGrid = svmTuneGrid,
                trControl = trainControl(method = "repeatedcv",
                                         repeats = 5,
                                         classProbs = TRUE))
## classProbs = TRUE was added since the text was written

## Print the results
print(
    svmFit
)

## A line plot of the average performance. The 'scales' argument is actually an
## argument to xyplot that converts the x-axis to log-2 units.

print(
plot(svmFit, scales = list(x = list(log = 2)))
)

## Test set predictions

predictedClasses <- predict(svmFit, GermanCreditTest)
print(
    str(predictedClasses)
)

## Use the "type" option to get class probabilities

predictedProbs <- predict(svmFit, newdata = GermanCreditTest, type = "prob")
print(
    head(predictedProbs)
)
Attaching package: 'caret'

The following object is masked from 'package:pls':

    R2

Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictor
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 

Resampling results across tuning parameters:

  C       Accuracy  Kappa      Accuracy SD  Kappa SD 
    0.25  0.74125   0.3515540  0.05046025   0.1175042
    0.50  0.74050   0.3462643  0.05178941   0.1205921
    1.00  0.74475   0.3441089  0.05070234   0.1194702
    2.00  0.74175   0.3209028  0.04681229   0.1193335
    4.00  0.74275   0.3160328  0.04890967   0.1220800
    8.00  0.75325   0.3389174  0.04836682   0.1291946
   16.00  0.74700   0.3081410  0.04428859   0.1252361
   32.00  0.74200   0.2922277  0.04466142   0.1274896
   64.00  0.73975   0.2727270  0.04451338   0.1371257
  128.00  0.73650   0.2763129  0.04495179   0.1278093

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.008918477 and C = 8. 
 Factor w/ 2 levels "Bad","Good": 1 2 2 2 1 2 2 2 1 1 ...
NULL
         Bad      Good
1 0.58917636 0.4108236
2 0.49818809 0.5018119
3 0.31073860 0.6892614
4 0.08949224 0.9105078
5 0.60453392 0.3954661
6 0.13487103 0.8651290
In [122]:
%%R

######## This computation can take over a half hour to complete on a single cpu.

## Fit the same model using different resampling methods. The main syntax change
## is the control object.

set.seed(1056)
svmFit10CV <- train(Class ~ .,
                    data = GermanCreditTrain,
                    method = "svmRadial",
                    preProc = c("center", "scale"),
                    tuneGrid = svmTuneGrid,
                    trControl = trainControl(method = "cv", number = 10))
print(
    svmFit10CV
)

set.seed(1056)
svmFitLOO <- train(Class ~ .,
                   data = GermanCreditTrain,
                   method = "svmRadial",
                   preProc = c("center", "scale"),
                   tuneGrid = svmTuneGrid,
                   trControl = trainControl(method = "LOOCV"))
print(
    svmFitLOO
)

set.seed(1056)
svmFitLGO <- train(Class ~ .,
                   data = GermanCreditTrain,
                   method = "svmRadial",
                   preProc = c("center", "scale"),
                   tuneGrid = svmTuneGrid,
                   trControl = trainControl(method = "LGOCV",
                                            number = 50,
                                            p = .8))
print(
    svmFitLGO
)

set.seed(1056)
svmFitBoot <- train(Class ~ .,
                    data = GermanCreditTrain,
                    method = "svmRadial",
                    preProc = c("center", "scale"),
                    tuneGrid = svmTuneGrid,
                    trControl = trainControl(method = "boot", number = 50))
print(
    svmFitBoot
)

set.seed(1056)
svmFitBoot632 <- train(Class ~ .,
                       data = GermanCreditTrain,
                       method = "svmRadial",
                       preProc = c("center", "scale"),
                       tuneGrid = svmTuneGrid,
                       trControl = trainControl(method = "boot632",
                                                number = 50))
print(
    svmFitBoot632
)
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictor
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 

Resampling results across tuning parameters:

  C       Accuracy  Kappa       Accuracy SD  Kappa SD  
    0.25  0.70000   0.00000000  0.00000000   0.00000000
    0.50  0.71875   0.09343326  0.01886539   0.07094452
    1.00  0.74375   0.27692135  0.02224391   0.07950763
    2.00  0.75875   0.36149069  0.03230175   0.07626079
    4.00  0.75500   0.36809516  0.04216370   0.11887279
    8.00  0.76125   0.39541476  0.03653860   0.10447322
   16.00  0.76625   0.41855404  0.04168749   0.11283531
   32.00  0.74875   0.38824618  0.04427267   0.10316210
   64.00  0.72875   0.34921040  0.04715886   0.10823541
  128.00  0.72875   0.35220213  0.04678927   0.10785380

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.008918477 and C = 16. 
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictor
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: 

Summary of sample sizes: 799, 799, 799, 799, 799, 799, ... 

Resampling results across tuning parameters:

  C       Accuracy  Kappa    
    0.25  0.70000   0.0000000
    0.50  0.71750   0.1003185
    1.00  0.74875   0.3049793
    2.00  0.74000   0.3157895
    4.00  0.74875   0.3582375
    8.00  0.76125   0.4068323
   16.00  0.76125   0.4169719
   32.00  0.72250   0.3345324
   64.00  0.71625   0.3268090
  128.00  0.72000   0.3333333

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.008918477 and C = 8. 
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictor
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (50 reps, 0.8%) 

Summary of sample sizes: 640, 640, 640, 640, 640, 640, ... 

Resampling results across tuning parameters:

  C       Accuracy  Kappa       Accuracy SD  Kappa SD  
    0.25  0.700000  0.00000000  0.000000000  0.00000000
    0.50  0.711125  0.06691009  0.009557326  0.03877930
    1.00  0.737000  0.25887472  0.022440397  0.06320724
    2.00  0.740750  0.31816867  0.023765435  0.06074014
    4.00  0.743125  0.35076031  0.028071803  0.06804724
    8.00  0.745000  0.36985984  0.025222227  0.06174940
   16.00  0.738500  0.36501972  0.030445250  0.07631435
   32.00  0.729375  0.34893389  0.029646353  0.07227117
   64.00  0.721500  0.33509585  0.029346627  0.07130233
  128.00  0.714375  0.32063672  0.030389951  0.07486036

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.008918477 and C = 8. 
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictor
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: Bootstrapped (50 reps) 

Summary of sample sizes: 800, 800, 800, 800, 800, 800, ... 

Resampling results across tuning parameters:

  C       Accuracy   Kappa       Accuracy SD  Kappa SD  
    0.25  0.7040934  0.01896068  0.02637422   0.03273562
    0.50  0.7275975  0.18611337  0.03062648   0.08794391
    1.00  0.7388778  0.29026235  0.02445672   0.06765864
    2.00  0.7420822  0.32895315  0.01767895   0.05040255
    4.00  0.7421938  0.34486682  0.01833609   0.04747891
    8.00  0.7405316  0.35362257  0.01907557   0.05017752
   16.00  0.7349648  0.34738355  0.01916738   0.04500902
   32.00  0.7294466  0.34058430  0.02168677   0.04904437
   64.00  0.7234922  0.32974005  0.02297203   0.05086115
  128.00  0.7209653  0.32439609  0.02321969   0.05087069

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.008918477 and C = 4. 
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictor
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: Bootstrapped (50 reps) 

Summary of sample sizes: 800, 800, 800, 800, 800, 800, ... 

Resampling results across tuning parameters:

  C       Accuracy   Kappa       Accuracy SD  Kappa SD  
    0.25  0.7025875  0.01198544  0.02637422   0.03273562
    0.50  0.7330798  0.18856955  0.03062648   0.08794391
    1.00  0.7655020  0.35980922  0.02445672   0.06765864
    2.00  0.7827026  0.43450963  0.01767895   0.05040255
    4.00  0.7979482  0.48754429  0.01833609   0.04747891
    8.00  0.8102331  0.52782744  0.01907557   0.05017752
   16.00  0.8177506  0.55166437  0.01916738   0.04500902
   32.00  0.8229996  0.56881674  0.02168677   0.04904437
   64.00  0.8219948  0.56862328  0.02297203   0.05086115
  128.00  0.8226967  0.57074712  0.02321969   0.05087069

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.008918477 and C = 32. 
In [61]:
%%R

### Section 4.8 Choosing Between Models

set.seed(1056)
glmProfile <- train(Class ~ .,
                    data = GermanCreditTrain,
                    method = "glm",
                    trControl = trainControl(method = "repeatedcv",
                                             repeats = 5))
print(
    glmProfile
)

resamp <- resamples(list(SVM = svmFit, Logistic = glmProfile))
print(
    summary(resamp)
)

## These results are slightly different from those shown in the text.
## There are some differences in the train() function since the
## original results were produced. This is due to a difference in
## predictions from the ksvm() function when class probs are requested
## and when they are not. See, for example,
## https://stat.ethz.ch/pipermail/r-help/2013-November/363188.html

modelDifferences <- diff(resamp)
print(
    summary(modelDifferences)
)

## The actual paired t-test:
print(
    modelDifferences$statistics$Accuracy
)
Generalized Linear Model 

800 samples
 41 predictor
  2 classes: 'Bad', 'Good' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 

Resampling results

  Accuracy  Kappa      Accuracy SD  Kappa SD 
  0.749     0.3647664  0.05162166   0.1218109

 

Call:
summary.resamples(object = resamp)

Models: SVM, Logistic 
Number of resamples: 50 

Accuracy 
           Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
SVM      0.6500   0.725 0.7625 0.7532  0.7969 0.8375    0
Logistic 0.6125   0.725 0.7562 0.7490  0.7844 0.8500    0

Kappa 
            Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
SVM      0.02778  0.2445 0.3667 0.3389  0.4444 0.5548    0
Logistic 0.07534  0.2831 0.3750 0.3648  0.4504 0.6250    0


Call:
summary.diff.resamples(object = modelDifferences)

p-value adjustment: bonferroni 
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0

Accuracy 
         SVM    Logistic
SVM             0.00425 
Logistic 0.4585         

Kappa 
         SVM     Logistic
SVM              -0.02585
Logistic 0.07948         

$SVM.diff.Logistic

	One Sample t-test

data:  x
t = 0.7472, df = 49, p-value = 0.4585
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.007179558  0.015679558
sample estimates:
mean of x 
  0.00425 


In [62]:
%%R

showChapterScript(6)
NULL
In [63]:
%%R

showChapterOutput(6)
NULL
In [56]:
%%R -w 600 -h 600

runChapterScript(6)

##     user  system elapsed 
##  540.993  74.917 615.942
> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Jo .... [TRUNCATED] 

> data(solubility)

> library(lattice)

> ### Some initial plots of the data
> 
> xyplot(solTrainY ~ solTrainX$MolWeight, type = c("p", "g"),
+        ylab = "Solubility (log)",
+        mai .... [TRUNCATED] 

> xyplot(solTrainY ~ solTrainX$NumRotBonds, type = c("p", "g"),
+        ylab = "Solubility (log)",
+        xlab = "Number of Rotatable Bonds")

> bwplot(solTrainY ~ ifelse(solTrainX[,100] == 1, 
+                           "structure present", 
+                           "structure absent"),
 .... [TRUNCATED] 

> ### Find the columns that are not fingerprints (i.e. the continuous
> ### predictors). grep will return a list of integers corresponding to
> ### co .... [TRUNCATED] 

> library(caret)

> featurePlot(solTrainXtrans[, -notFingerprints],
+             solTrainY,
+             between = list(x = 1, y = 1),
+             type = c("g", "p" .... [TRUNCATED] 

> library(corrplot)

> ### We used the full namespace to call this function because the pls
> ### package (also used in this chapter) has a function with the same
> ### na .... [TRUNCATED] 

> ################################################################################
> ### Section 6.2 Linear Regression
> 
> ### Create a control funct .... [TRUNCATED] 

> indx <- createFolds(solTrainY, returnTrain = TRUE)

> ctrl <- trainControl(method = "cv", index = indx)

> ### Linear regression model with all of the predictors. This will
> ### produce some warnings that a 'rank-deficient fit may be
> ### misleading'. T .... [TRUNCATED] 

> lmTune0 <- train(x = solTrainXtrans, y = solTrainY,
+                  method = "lm",
+                  trControl = ctrl)

> lmTune0                 
Linear Regression 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 

Resampling results

  RMSE       Rsquared   RMSE SD     Rsquared SD
  0.7210355  0.8768359  0.06998223  0.02467069 

 

> ### And another using a set of predictors reduced by unsupervised
> ### filtering. We apply a filter to reduce extreme between-predictor
> ### corre .... [TRUNCATED] 

> trainXfiltered <- solTrainXtrans[, -tooHigh]

> testXfiltered  <-  solTestXtrans[, -tooHigh]

> set.seed(100)

> lmTune <- train(x = trainXfiltered, y = solTrainY,
+                 method = "lm",
+                 trControl = ctrl)

> lmTune
Linear Regression 

951 samples
190 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 

Resampling results

  RMSE       Rsquared   RMSE SD     Rsquared SD
  0.7113935  0.8793396  0.06320545  0.02434305 

 

> ### Save the test set results in a data frame                 
> testResults <- data.frame(obs = solTestY,
+                           Linear_Regres .... [TRUNCATED] 

> ################################################################################
> ### Section 6.3 Partial Least Squares
> 
> ## Run PLS and PCR on  .... [TRUNCATED] 

> plsTune <- train(x = solTrainXtrans, y = solTrainY,
+                  method = "pls",
+                  tuneGrid = expand.grid(ncomp = 1:20),
+    .... [TRUNCATED] 
Loading required package: pls

Attaching package: 'pls'

The following object is masked from 'package:corrplot':

    corrplot

The following object is masked from 'package:caret':

    R2

The following object is masked from 'package:stats':

    loadings


> plsTune
Partial Least Squares 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 

Resampling results across tuning parameters:

  ncomp  RMSE       Rsquared   RMSE SD     Rsquared SD
   1     1.7543811  0.2630495  0.08396462  0.06500848 
   2     1.2720647  0.6128490  0.07938883  0.05345622 
   3     1.0373646  0.7432147  0.07155432  0.02761174 
   4     0.8370618  0.8317217  0.05615036  0.02574808 
   5     0.7458318  0.8660461  0.03778846  0.01932122 
   6     0.7106591  0.8779019  0.03432693  0.02281696 
   7     0.6921293  0.8841448  0.03794937  0.02403533 
   8     0.6908481  0.8851647  0.03282238  0.01967729 
   9     0.6828771  0.8877056  0.02910576  0.01851863 
  10     0.6824521  0.8879195  0.03050242  0.01870212 
  11     0.6826719  0.8878955  0.02914169  0.01953986 
  12     0.6847473  0.8872488  0.03726823  0.01936983 
  13     0.6836698  0.8875568  0.03972887  0.01935437 
  14     0.6856134  0.8871389  0.03984337  0.01855409 
  15     0.6867190  0.8869351  0.04224044  0.01944079 
  16     0.6860797  0.8872705  0.04359318  0.02079411 
  17     0.6881636  0.8866078  0.04626247  0.02130103 
  18     0.6926077  0.8853743  0.04810637  0.02213141 
  19     0.6943936  0.8848611  0.04858541  0.02206531 
  20     0.6977396  0.8837453  0.05295825  0.02247232 

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was ncomp = 10. 

> testResults$PLS <- predict(plsTune, solTestXtrans)

> set.seed(100)

> pcrTune <- train(x = solTrainXtrans, y = solTrainY,
+                  method = "pcr",
+                  tuneGrid = expand.grid(ncomp = 1:35),
+    .... [TRUNCATED] 

> pcrTune                  
Principal Component Analysis 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 

Resampling results across tuning parameters:

  ncomp  RMSE       Rsquared    RMSE SD     Rsquared SD
   1     1.9778920  0.06590758  0.11043847  0.03465612 
   2     1.6379400  0.36202127  0.09825075  0.08480717 
   3     1.3655645  0.55546442  0.09395858  0.04528156 
   4     1.3715028  0.55157507  0.09810878  0.04757889 
   5     1.3415864  0.57099834  0.10467614  0.06166222 
   6     1.2081745  0.64973828  0.08788513  0.06148380 
   7     1.1818622  0.66578017  0.10108519  0.06050609 
   8     1.1452119  0.68759737  0.07782801  0.04078188 
   9     1.0495852  0.73655117  0.08201882  0.03697880 
  10     1.0063822  0.75723962  0.09589129  0.04169283 
  11     0.9723334  0.77443568  0.07775156  0.02843482 
  12     0.9692845  0.77566291  0.07887512  0.02905775 
  13     0.9526792  0.78316647  0.07637597  0.02724077 
  14     0.9396590  0.78895459  0.07056722  0.02444445 
  15     0.9419390  0.78796957  0.06837934  0.02414867 
  16     0.8695211  0.81842614  0.04668856  0.02511778 
  17     0.8699482  0.81825536  0.04575858  0.02485892 
  18     0.8719274  0.81723654  0.04753794  0.02576886 
  19     0.8695726  0.81824845  0.04727016  0.02659831 
  20     0.8682556  0.81894961  0.04730875  0.02681389 
  21     0.8096228  0.84189134  0.04576547  0.02447005 
  22     0.8122517  0.84082141  0.04477924  0.02426518 
  23     0.8093641  0.84200427  0.04457044  0.02513324 
  24     0.8096163  0.84210474  0.04011203  0.02327652 
  25     0.8095766  0.84208293  0.03900307  0.02355872 
  26     0.8049366  0.84421798  0.03676154  0.02129394 
  27     0.8039803  0.84465744  0.03378393  0.02036649 
  28     0.8056953  0.84397657  0.03395966  0.02100737 
  29     0.7863312  0.85146390  0.03603401  0.01889728 
  30     0.7819408  0.85271068  0.03068473  0.02057117 
  31     0.7795830  0.85355495  0.02832846  0.02096832 
  32     0.7757032  0.85503975  0.03571378  0.02166955 
  33     0.7395733  0.86853408  0.03063334  0.01813624 
  34     0.7327021  0.87065692  0.03102043  0.02117680 
  35     0.7307134  0.87142813  0.03570471  0.02195190 

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was ncomp = 35. 

> plsResamples <- plsTune$results

> plsResamples$Model <- "PLS"

> pcrResamples <- pcrTune$results

> pcrResamples$Model <- "PCR"

> plsPlotData <- rbind(plsResamples, pcrResamples)

> xyplot(RMSE ~ ncomp,
+        data = plsPlotData,
+        #aspect = 1,
+        xlab = "# Components",
+        ylab = "RMSE (Cross-Validation)",
+ .... [TRUNCATED] 

> plsImp <- varImp(plsTune, scale = FALSE)

> plot(plsImp, top = 25, scales = list(y = list(cex = .95)))

> ################################################################################
> ### Section 6.4 Penalized Models
> 
> ## The text used the elasti .... [TRUNCATED] 

> set.seed(100)

> ridgeTune <- train(x = solTrainXtrans, y = solTrainY,
+                    method = "ridge",
+                    tuneGrid = ridgeGrid,
+            .... [TRUNCATED] 
Loading required package: elasticnet
Loading required package: lars
Loaded lars 1.2


> ridgeTune
Ridge Regression 

951 samples
228 predictors

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 

Resampling results across tuning parameters:

  lambda       RMSE       Rsquared   RMSE SD     Rsquared SD
  0.000000000  0.7207117  0.8769717  0.06994063  0.02450628 
  0.007142857  0.7047552  0.8818659  0.04495581  0.01988253 
  0.014285714  0.6964731  0.8847911  0.04051497  0.01867276 
  0.021428571  0.6925923  0.8862699  0.03781419  0.01797165 
  0.028571429  0.6908607  0.8870609  0.03593594  0.01748178 
  0.035714286  0.6904220  0.8874561  0.03457159  0.01710886 
  0.042857143  0.6908548  0.8875998  0.03357310  0.01681167 
  0.050000000  0.6919207  0.8875741  0.03285297  0.01656815 
  0.057142857  0.6934783  0.8874278  0.03234969  0.01636278 
  0.064285714  0.6954114  0.8872009  0.03202921  0.01619286 
  0.071428571  0.6976723  0.8869096  0.03185067  0.01604581 
  0.078571429  0.7002069  0.8865723  0.03179153  0.01591906 
  0.085714286  0.7029801  0.8862009  0.03183151  0.01580906 
  0.092857143  0.7059656  0.8858041  0.03195417  0.01571305 
  0.100000000  0.7091432  0.8853885  0.03214610  0.01562886 

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was lambda = 0.03571429. 

> print(update(plot(ridgeTune), xlab = "Penalty"))

> enetGrid <- expand.grid(lambda = c(0, 0.01, .1), 
+                         fraction = seq(.05, 1, length = 20))

> set.seed(100)

> enetTune <- train(x = solTrainXtrans, y = solTrainY,
+                   method = "enet",
+                   tuneGrid = enetGrid,
+                 .... [TRUNCATED] 

> enetTune
Elasticnet 

951 samples
228 predictors

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 

Resampling results across tuning parameters:

  lambda  fraction  RMSE       Rsquared   RMSE SD     Rsquared SD
  0.00    0.05      0.8713747  0.8337289  0.03816148  0.02737681 
  0.00    0.10      0.6882637  0.8858786  0.04298815  0.02064030 
  0.00    0.15      0.6729264  0.8907993  0.03942228  0.01837582 
  0.00    0.20      0.6754697  0.8903865  0.03807506  0.01760700 
  0.00    0.25      0.6879252  0.8865202  0.04383623  0.01946378 
  0.00    0.30      0.6971062  0.8836414  0.04812788  0.02058289 
  0.00    0.35      0.7062274  0.8808469  0.05191262  0.02155822 
  0.00    0.40      0.7125900  0.8788942  0.05345207  0.02192952 
  0.00    0.45      0.7138742  0.8785588  0.05342746  0.02178996 
  0.00    0.50      0.7141235  0.8785622  0.05461747  0.02183522 
  0.00    0.55      0.7144669  0.8784961  0.05583323  0.02211744 
  0.00    0.60      0.7140532  0.8786593  0.05739702  0.02234513 
  0.00    0.65      0.7140599  0.8786880  0.05941448  0.02265512 
  0.00    0.70      0.7145464  0.8785744  0.06116481  0.02298579 
  0.00    0.75      0.7151011  0.8784348  0.06289926  0.02335653 
  0.00    0.80      0.7158067  0.8782629  0.06453350  0.02366829 
  0.00    0.85      0.7167918  0.8780158  0.06564865  0.02383283 
  0.00    0.90      0.7178711  0.8777467  0.06672370  0.02398923 
  0.00    0.95      0.7191448  0.8774055  0.06834509  0.02424302 
  0.00    1.00      0.7207117  0.8769717  0.06994063  0.02450628 
  0.01    0.05      1.5168857  0.6435177  0.11013983  0.07875588 
  0.01    0.10      1.1324481  0.7671388  0.07499369  0.04771971 
  0.01    0.15      0.9061843  0.8241043  0.05601707  0.02997353 
  0.01    0.20      0.7855269  0.8571170  0.04929439  0.02173949 
  0.01    0.25      0.7296380  0.8733531  0.04166558  0.02066970 
  0.01    0.30      0.6989522  0.8826020  0.04255257  0.02028148 
  0.01    0.35      0.6866513  0.8863490  0.04212287  0.01967040 
  0.01    0.40      0.6806730  0.8884346  0.03999669  0.01852187 
  0.01    0.45      0.6778780  0.8895285  0.03610764  0.01717676 
  0.01    0.50      0.6760780  0.8902871  0.03307570  0.01620142 
  0.01    0.55      0.6743998  0.8909724  0.03065386  0.01569024 
  0.01    0.60      0.6746777  0.8910026  0.03042481  0.01580700 
  0.01    0.65      0.6765522  0.8904906  0.03177438  0.01642381 
  0.01    0.70      0.6796775  0.8895768  0.03364893  0.01711767 
  0.01    0.75      0.6829651  0.8886182  0.03551058  0.01757998 
  0.01    0.80      0.6862396  0.8876472  0.03719803  0.01791970 
  0.01    0.85      0.6895735  0.8866477  0.03885651  0.01822379 
  0.01    0.90      0.6930103  0.8856210  0.04047457  0.01858065 
  0.01    0.95      0.6968398  0.8844630  0.04181671  0.01895729 
  0.01    1.00      0.7006283  0.8833050  0.04284382  0.01929610 
  0.10    0.05      1.6867967  0.5157969  0.13154407  0.08882307 
  0.10    0.10      1.4058744  0.6954146  0.10735405  0.06584337 
  0.10    0.15      1.1697385  0.7596795  0.08648027  0.04623881 
  0.10    0.20      1.0082617  0.7880698  0.06594126  0.03758966 
  0.10    0.25      0.8950440  0.8218825  0.05827006  0.02812113 
  0.10    0.30      0.8193443  0.8435444  0.05167792  0.02222192 
  0.10    0.35      0.7744593  0.8570276  0.04722049  0.02081488 
  0.10    0.40      0.7519611  0.8644826  0.04182081  0.01957350 
  0.10    0.45      0.7343282  0.8710631  0.03806132  0.01874198 
  0.10    0.50      0.7245543  0.8750318  0.03539926  0.01842909 
  0.10    0.55      0.7180823  0.8778937  0.03288742  0.01794844 
  0.10    0.60      0.7137901  0.8799906  0.03184857  0.01756183 
  0.10    0.65      0.7110967  0.8815343  0.03100037  0.01695475 
  0.10    0.70      0.7104058  0.8823940  0.02973462  0.01635597 
  0.10    0.75      0.7103284  0.8829674  0.02952904  0.01597719 
  0.10    0.80      0.7097899  0.8836319  0.03000022  0.01578241 
  0.10    0.85      0.7093246  0.8842290  0.03064013  0.01567030 
  0.10    0.90      0.7094949  0.8845954  0.03109508  0.01554030 
  0.10    0.95      0.7094181  0.8849823  0.03169989  0.01554197 
  0.10    1.00      0.7091432  0.8853885  0.03214610  0.01562886 

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were fraction = 0.15 and lambda = 0. 

> plot(enetTune)

> testResults$Enet <- predict(enetTune, solTestXtrans)

> ################################################################################
> ### Session Information
> 
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] C

attached base packages:
[1] parallel  tools     stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] elasticnet_1.1                  lars_1.2                       
 [3] pls_2.4-3                       kernlab_0.9-20                 
 [5] corrplot_0.73                   ellipse_0.3-8                  
 [7] e1071_1.6-4                     earth_4.2.0                    
 [9] plotrix_3.5-11                  plotmo_2.2.1                   
[11] doMC_1.3.3                      iterators_1.0.7                
[13] foreach_1.4.2                   AppliedPredictiveModeling_1.1-6
[15] caret_6.0-41                    ggplot2_1.0.1                  
[17] lattice_0.20-31                

loaded via a namespace (and not attached):
 [1] BradleyTerry2_1.0-6 CORElearn_0.9.45    MASS_7.3-40        
 [4] Matrix_1.1-5        Rcpp_0.11.5         SparseM_1.6        
 [7] brglm_0.5-9         car_2.0-25          class_7.3-12       
[10] cluster_2.0.1       codetools_0.2-10    colorspace_1.2-6   
[13] compiler_3.1.3      digest_0.6.8        grid_3.1.3         
[16] gtable_0.1.2        gtools_3.4.1        lme4_1.1-7         
[19] mgcv_1.8-4          minqa_1.2.4         munsell_0.4.2      
[22] nlme_3.1-120        nloptr_1.0.4        nnet_7.3-9         
[25] pbkrtest_0.4-2      plyr_1.8.1          proto_0.3-10       
[28] quantreg_5.11       reshape2_1.4.1      rpart_4.1-9        
[31] scales_0.2.4        splines_3.1.3       stringr_0.6.2      

> ### q("no")
> 
> 
> 
In [65]:
%%R

### Section 6.1 Case Study: Quantitative Structure- Activity
### Relationship Modeling

library(AppliedPredictiveModeling)
data(solubility)

library(lattice)

### Some initial plots of the data
print(
xyplot(solTrainY ~ solTrainX$MolWeight, type = c("p", "g"),
       ylab = "Solubility (log)",
       main = "(a)",
       xlab = "Molecular Weight")
)
print(
xyplot(solTrainY ~ solTrainX$NumRotBonds, type = c("p", "g"),
       ylab = "Solubility (log)",
       xlab = "Number of Rotatable Bonds")
)
print(
bwplot(solTrainY ~ ifelse(solTrainX[,100] == 1,
                          "structure present",
                          "structure absent"),
       ylab = "Solubility (log)",
       main = "(b)",
       horizontal = FALSE)
)
In [127]:
%%R

### Find the columns that are not fingerprints (i.e. the continuous
### predictors). grep will return a list of integers corresponding to
### column names that contain the pattern "FP".

notFingerprints <- grep("FP", names(solTrainXtrans))

library(caret)
print(
featurePlot(solTrainXtrans[, -notFingerprints],
            solTrainY,
            between = list(x = 1, y = 1),
            type = c("g", "p", "smooth"),
            labels = rep("", 2))
)
In [128]:
%%R

library(corrplot)

### We used the full namespace to call this function because the pls
### package (also used in this chapter) has a function with the same
### name.

corrplot::corrplot(cor(solTrainXtrans[, -notFingerprints]),
                   order = "hclust",
                   tl.cex = .8)
In [67]:
%%R

### Section 6.2 Linear Regression

### Create a control function that will be used across models. We
### create the fold assignments explicitly instead of relying on the
### random number seed being set to identical values.

set.seed(100)
indx <- createFolds(solTrainY, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)

### Linear regression model with all of the predictors. This will
### produce some warnings that a 'rank-deficient fit may be
### misleading'. This is related to the predictors being so highly
### correlated that some of the math has broken down.


set.seed(100)
lmTune0 <- train(x = solTrainXtrans, y = solTrainY,
                 method = "lm",
                 trControl = ctrl)
print(
lmTune0
)
### And another using a set of predictors reduced by unsupervised
### filtering. We apply a filter to reduce extreme between-predictor
### correlations. Note the lack of warnings.

tooHigh <- findCorrelation(cor(solTrainXtrans), .9)
trainXfiltered <- solTrainXtrans[, -tooHigh]
testXfiltered  <-  solTestXtrans[, -tooHigh]

set.seed(100)
lmTune <- train(x = trainXfiltered, y = solTrainY,
                method = "lm",
                trControl = ctrl)
print(
lmTune
)
### Save the test set results in a data frame
testResults <- data.frame(obs = solTestY,
                          Linear_Regression = predict(lmTune, testXfiltered))
Linear Regression 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 

Resampling results

  RMSE       Rsquared   RMSE SD     Rsquared SD
  0.7210355  0.8768359  0.06998223  0.02467069 

 
Linear Regression 

951 samples
190 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 

Resampling results

  RMSE       Rsquared   RMSE SD     Rsquared SD
  0.7113935  0.8793396  0.06320545  0.02434305 

 
In [69]:
%%R

### Section 6.3 Partial Least Squares

## Run PLS and PCR on solubility data and compare results
set.seed(100)
plsTune <- train(x = solTrainXtrans, y = solTrainY,
                 method = "pls",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl)
print(
plsTune
)

testResults$PLS <- predict(plsTune, solTestXtrans)

set.seed(100)
pcrTune <- train(x = solTrainXtrans, y = solTrainY,
                 method = "pcr",
                 tuneGrid = expand.grid(ncomp = 1:35),
                 trControl = ctrl)
print(
pcrTune
)
Partial Least Squares 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 

Resampling results across tuning parameters:

  ncomp  RMSE       Rsquared   RMSE SD     Rsquared SD
   1     1.7543811  0.2630495  0.08396462  0.06500848 
   2     1.2720647  0.6128490  0.07938883  0.05345622 
   3     1.0373646  0.7432147  0.07155432  0.02761174 
   4     0.8370618  0.8317217  0.05615036  0.02574808 
   5     0.7458318  0.8660461  0.03778846  0.01932122 
   6     0.7106591  0.8779019  0.03432693  0.02281696 
   7     0.6921293  0.8841448  0.03794937  0.02403533 
   8     0.6908481  0.8851647  0.03282238  0.01967729 
   9     0.6828771  0.8877056  0.02910576  0.01851863 
  10     0.6824521  0.8879195  0.03050242  0.01870212 
  11     0.6826719  0.8878955  0.02914169  0.01953986 
  12     0.6847473  0.8872488  0.03726823  0.01936983 
  13     0.6836698  0.8875568  0.03972887  0.01935437 
  14     0.6856134  0.8871389  0.03984337  0.01855409 
  15     0.6867190  0.8869351  0.04224044  0.01944079 
  16     0.6860797  0.8872705  0.04359318  0.02079411 
  17     0.6881636  0.8866078  0.04626247  0.02130103 
  18     0.6926077  0.8853743  0.04810637  0.02213141 
  19     0.6943936  0.8848611  0.04858541  0.02206531 
  20     0.6977396  0.8837453  0.05295825  0.02247232 

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was ncomp = 10. 
Principal Component Analysis 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 

Resampling results across tuning parameters:

  ncomp  RMSE       Rsquared    RMSE SD     Rsquared SD
   1     1.9778920  0.06590758  0.11043847  0.03465612 
   2     1.6379400  0.36202127  0.09825075  0.08480717 
   3     1.3655645  0.55546442  0.09395858  0.04528156 
   4     1.3715028  0.55157507  0.09810878  0.04757889 
   5     1.3415864  0.57099834  0.10467614  0.06166222 
   6     1.2081745  0.64973828  0.08788513  0.06148380 
   7     1.1818622  0.66578017  0.10108519  0.06050609 
   8     1.1452119  0.68759737  0.07782801  0.04078188 
   9     1.0495852  0.73655117  0.08201882  0.03697880 
  10     1.0063822  0.75723962  0.09589129  0.04169283 
  11     0.9723334  0.77443568  0.07775156  0.02843482 
  12     0.9692845  0.77566291  0.07887512  0.02905775 
  13     0.9526792  0.78316647  0.07637597  0.02724077 
  14     0.9396590  0.78895459  0.07056722  0.02444445 
  15     0.9419390  0.78796957  0.06837934  0.02414867 
  16     0.8695211  0.81842614  0.04668856  0.02511778 
  17     0.8699482  0.81825536  0.04575858  0.02485892 
  18     0.8719274  0.81723654  0.04753794  0.02576886 
  19     0.8695726  0.81824845  0.04727016  0.02659831 
  20     0.8682556  0.81894961  0.04730875  0.02681389 
  21     0.8096228  0.84189134  0.04576547  0.02447005 
  22     0.8122517  0.84082141  0.04477924  0.02426518 
  23     0.8093641  0.84200427  0.04457044  0.02513324 
  24     0.8096163  0.84210474  0.04011203  0.02327652 
  25     0.8095766  0.84208293  0.03900307  0.02355872 
  26     0.8049366  0.84421798  0.03676154  0.02129394 
  27     0.8039803  0.84465744  0.03378393  0.02036649 
  28     0.8056953  0.84397657  0.03395966  0.02100737 
  29     0.7863312  0.85146390  0.03603401  0.01889728 
  30     0.7819408  0.85271068  0.03068473  0.02057117 
  31     0.7795830  0.85355495  0.02832846  0.02096832 
  32     0.7757032  0.85503975  0.03571378  0.02166955 
  33     0.7395733  0.86853408  0.03063334  0.01813624 
  34     0.7327021  0.87065692  0.03102043  0.02117680 
  35     0.7307134  0.87142813  0.03570471  0.02195190 

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was ncomp = 35. 
In [70]:
%%R

plsResamples <- plsTune$results
plsResamples$Model <- "PLS"
pcrResamples <- pcrTune$results
pcrResamples$Model <- "PCR"
plsPlotData <- rbind(plsResamples, pcrResamples)

print(
xyplot(RMSE ~ ncomp,
       data = plsPlotData,
       #aspect = 1,
       xlab = "# Components",
       ylab = "RMSE (Cross-Validation)",
       auto.key = list(columns = 2),
       groups = Model,
       type = c("o", "g"))
)

plsImp <- varImp(plsTune, scale = FALSE)
plot(plsImp, top = 25, scales = list(y = list(cex = .95)))
In [71]:
%%R

### Section 6.4 Penalized Models

## The text used the elasticnet to obtain a ridge regression model.
## There is now a simple ridge regression method.

ridgeGrid <- expand.grid(lambda = seq(0, .1, length = 15))

set.seed(100)
ridgeTune <- train(x = solTrainXtrans, y = solTrainY,
                   method = "ridge",
                   tuneGrid = ridgeGrid,
                   trControl = ctrl,
                   preProc = c("center", "scale"))
print(
ridgeTune
)

print(update(plot(ridgeTune), xlab = "Penalty"))
Loading required package: elasticnet
Loading required package: lars
Loaded lars 1.2

Ridge Regression 

951 samples
228 predictors

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 

Resampling results across tuning parameters:

  lambda       RMSE       Rsquared   RMSE SD     Rsquared SD
  0.000000000  0.7207117  0.8769717  0.06994063  0.02450628 
  0.007142857  0.7047552  0.8818659  0.04495581  0.01988253 
  0.014285714  0.6964731  0.8847911  0.04051497  0.01867276 
  0.021428571  0.6925923  0.8862699  0.03781419  0.01797165 
  0.028571429  0.6908607  0.8870609  0.03593594  0.01748178 
  0.035714286  0.6904220  0.8874561  0.03457159  0.01710886 
  0.042857143  0.6908548  0.8875998  0.03357310  0.01681167 
  0.050000000  0.6919207  0.8875741  0.03285297  0.01656815 
  0.057142857  0.6934783  0.8874278  0.03234969  0.01636278 
  0.064285714  0.6954114  0.8872009  0.03202921  0.01619286 
  0.071428571  0.6976723  0.8869096  0.03185067  0.01604581 
  0.078571429  0.7002069  0.8865723  0.03179153  0.01591906 
  0.085714286  0.7029801  0.8862009  0.03183151  0.01580906 
  0.092857143  0.7059656  0.8858041  0.03195417  0.01571305 
  0.100000000  0.7091432  0.8853885  0.03214610  0.01562886 

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was lambda = 0.03571429. 
In [80]:
%%R

enetGrid <- expand.grid(lambda = c(0, 0.01, .1),
                        fraction = seq(.05, 1, length = 20))
set.seed(100)
enetTune <- train(x = solTrainXtrans, y = solTrainY,
                  method = "enet",
                  tuneGrid = enetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"))
print(
enetTune
)

print(
plot(enetTune)
)
testResults$Enet <- predict(enetTune, solTestXtrans)
Elasticnet 

951 samples
228 predictors

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 

Resampling results across tuning parameters:

  lambda  fraction  RMSE       Rsquared   RMSE SD     Rsquared SD
  0.00    0.05      0.8713747  0.8337289  0.03816148  0.02737681 
  0.00    0.10      0.6882637  0.8858786  0.04298815  0.02064030 
  0.00    0.15      0.6729264  0.8907993  0.03942228  0.01837582 
  0.00    0.20      0.6754697  0.8903865  0.03807506  0.01760700 
  0.00    0.25      0.6879252  0.8865202  0.04383623  0.01946378 
  0.00    0.30      0.6971062  0.8836414  0.04812788  0.02058289 
  0.00    0.35      0.7062274  0.8808469  0.05191262  0.02155822 
  0.00    0.40      0.7125900  0.8788942  0.05345207  0.02192952 
  0.00    0.45      0.7138742  0.8785588  0.05342746  0.02178996 
  0.00    0.50      0.7141235  0.8785622  0.05461747  0.02183522 
  0.00    0.55      0.7144669  0.8784961  0.05583323  0.02211744 
  0.00    0.60      0.7140532  0.8786593  0.05739702  0.02234513 
  0.00    0.65      0.7140599  0.8786880  0.05941448  0.02265512 
  0.00    0.70      0.7145464  0.8785744  0.06116481  0.02298579 
  0.00    0.75      0.7151011  0.8784348  0.06289926  0.02335653 
  0.00    0.80      0.7158067  0.8782629  0.06453350  0.02366829 
  0.00    0.85      0.7167918  0.8780158  0.06564865  0.02383283 
  0.00    0.90      0.7178711  0.8777467  0.06672370  0.02398923 
  0.00    0.95      0.7191448  0.8774055  0.06834509  0.02424302 
  0.00    1.00      0.7207117  0.8769717  0.06994063  0.02450628 
  0.01    0.05      1.5168857  0.6435177  0.11013983  0.07875588 
  0.01    0.10      1.1324481  0.7671388  0.07499369  0.04771971 
  0.01    0.15      0.9061843  0.8241043  0.05601707  0.02997353 
  0.01    0.20      0.7855269  0.8571170  0.04929439  0.02173949 
  0.01    0.25      0.7296380  0.8733531  0.04166558  0.02066970 
  0.01    0.30      0.6989522  0.8826020  0.04255257  0.02028148 
  0.01    0.35      0.6866513  0.8863490  0.04212287  0.01967040 
  0.01    0.40      0.6806730  0.8884346  0.03999669  0.01852187 
  0.01    0.45      0.6778780  0.8895285  0.03610764  0.01717676 
  0.01    0.50      0.6760780  0.8902871  0.03307570  0.01620142 
  0.01    0.55      0.6743998  0.8909724  0.03065386  0.01569024 
  0.01    0.60      0.6746777  0.8910026  0.03042481  0.01580700 
  0.01    0.65      0.6765522  0.8904906  0.03177438  0.01642381 
  0.01    0.70      0.6796775  0.8895768  0.03364893  0.01711767 
  0.01    0.75      0.6829651  0.8886182  0.03551058  0.01757998 
  0.01    0.80      0.6862396  0.8876472  0.03719803  0.01791970 
  0.01    0.85      0.6895735  0.8866477  0.03885651  0.01822379 
  0.01    0.90      0.6930103  0.8856210  0.04047457  0.01858065 
  0.01    0.95      0.6968398  0.8844630  0.04181671  0.01895729 
  0.01    1.00      0.7006283  0.8833050  0.04284382  0.01929610 
  0.10    0.05      1.6867967  0.5157969  0.13154407  0.08882307 
  0.10    0.10      1.4058744  0.6954146  0.10735405  0.06584337 
  0.10    0.15      1.1697385  0.7596795  0.08648027  0.04623881 
  0.10    0.20      1.0082617  0.7880698  0.06594126  0.03758966 
  0.10    0.25      0.8950440  0.8218825  0.05827006  0.02812113 
  0.10    0.30      0.8193443  0.8435444  0.05167792  0.02222192 
  0.10    0.35      0.7744593  0.8570276  0.04722049  0.02081488 
  0.10    0.40      0.7519611  0.8644826  0.04182081  0.01957350 
  0.10    0.45      0.7343282  0.8710631  0.03806132  0.01874198 
  0.10    0.50      0.7245543  0.8750318  0.03539926  0.01842909 
  0.10    0.55      0.7180823  0.8778937  0.03288742  0.01794844 
  0.10    0.60      0.7137901  0.8799906  0.03184857  0.01756183 
  0.10    0.65      0.7110967  0.8815343  0.03100037  0.01695475 
  0.10    0.70      0.7104058  0.8823940  0.02973462  0.01635597 
  0.10    0.75      0.7103284  0.8829674  0.02952904  0.01597719 
  0.10    0.80      0.7097899  0.8836319  0.03000022  0.01578241 
  0.10    0.85      0.7093246  0.8842290  0.03064013  0.01567030 
  0.10    0.90      0.7094949  0.8845954  0.03109508  0.01554030 
  0.10    0.95      0.7094181  0.8849823  0.03169989  0.01554197 
  0.10    1.00      0.7091432  0.8853885  0.03214610  0.01562886 

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were fraction = 0.15 and lambda = 0. 
In [75]:
%%R

showChapterScript(7)
NULL
In [76]:
%%R

showChapterOutput(7)
NULL
In [59]:
%%R -w 600 -h 600

## runChapterScript(7)

##        user     system    elapsed 
##  112106.723    188.979  12272.168
NULL
In [77]:
%%R

showChapterScript(8)
NULL
In [78]:
%%R

showChapterOutput(8)
NULL
In [62]:
%%R -w 600 -h 600

##  runChapterScript(8)

##       user    system   elapsed 
##  21280.849   500.609  6798.887
NULL
In [79]:
%%R

showChapterScript(10)
NULL
In [64]:
%%R

showChapterOutput(10)
R Information
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com)
> ###
> ### Chapter 10: Case Study: Compressive Strength of Concrete Mixtures
> ###
> ### Required packages: AppliedPredictiveModeling, caret, Cubist, doMC (optional),
> ###                    earth, elasticnet, gbm, ipred, lattice, nnet, party, pls,
> ###                    randomForests, rpart, RWeka      
> ###
> ### Data used: The concrete from the AppliedPredictiveModeling package
> ###
> ### Notes: 
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be 
> ### syntax differences that occur over time as packages evolve. These files 
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in 
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
> 
> ################################################################################
> ### Load the data and plot the data
> 
> library(AppliedPredictiveModeling)
> data(concrete)
> 
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> library(plyr)
> 
> featurePlot(concrete[, -9], concrete$CompressiveStrength,
+             between = list(x = 1, y = 1),
+             type = c("g", "p", "smooth"))
> 
> 
> ################################################################################
> ### Section 10.1 Model Building Strategy
> ### There are replicated mixtures, so take the average per mixture
>             
> averaged <- ddply(mixtures,
+                   .(Cement, BlastFurnaceSlag, FlyAsh, Water, 
+                     Superplasticizer, CoarseAggregate, 
+                     FineAggregate, Age),
+                   function(x) c(CompressiveStrength = 
+                     mean(x$CompressiveStrength)))
> 
> ### Split the data and create a control object for train()
> 
> set.seed(975)
> inTrain <- createDataPartition(averaged$CompressiveStrength, p = 3/4)[[1]]
> training <- averaged[ inTrain,]
> testing  <- averaged[-inTrain,]
> 
> ctrl <- trainControl(method = "repeatedcv", repeats = 5, number = 10)
> 
> ### Create a model formula that can be used repeatedly
> 
> modForm <- paste("CompressiveStrength ~ (.)^2 + I(Cement^2) + I(BlastFurnaceSlag^2) +",
+                  "I(FlyAsh^2)  + I(Water^2) + I(Superplasticizer^2)  +",
+                  "I(CoarseAggregate^2) +  I(FineAggregate^2) + I(Age^2)")
> modForm <- as.formula(modForm)
> 
> ### Fit the various models
> 
> ### Optional: parallel processing can be used via the 'do' packages,
> ### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
> ### up the computations.
> 
> ### WARNING: Be aware of how much memory is needed to parallel
> ### process. It can very quickly overwhelm the available hardware. The
> ### estimate of the median memory usage (VSIZE = total memory size) 
> ### was 2800M for a core although the M5 calculations require about 
> ### 3700M without parallel processing. 
> 
> ### WARNING 2: The RWeka package does not work well with some forms of
> ### parallel processing, such as mutlicore (i.e. doMC). 
> 
> library(doMC)
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
> registerDoMC(14)
> 
> set.seed(669)
> lmFit <- train(modForm, data = training,
+                method = "lm",
+                trControl = ctrl)
> 
> set.seed(669)
> plsFit <- train(modForm, data = training,
+                 method = "pls",
+                 preProc = c("center", "scale"),
+                 tuneLength = 15,
+                 trControl = ctrl)
Loading required package: pls

Attaching package: ‘pls’

The following object is masked from ‘package:caret’:

    R2

The following object is masked from ‘package:stats’:

    loadings

> 
> lassoGrid <- expand.grid(lambda = c(0, .001, .01, .1), 
+                          fraction = seq(0.05, 1, length = 20))
> set.seed(669)
> lassoFit <- train(modForm, data = training,
+                   method = "enet",
+                   preProc = c("center", "scale"),
+                   tuneGrid = lassoGrid,
+                   trControl = ctrl)
Loading required package: elasticnet
Loading required package: lars
Loaded lars 1.1

> 
> set.seed(669)
> earthFit <- train(CompressiveStrength ~ ., data = training,
+                   method = "earth",
+                   tuneGrid = expand.grid(degree = 1, 
+                                          nprune = 2:25),
+                   trControl = ctrl)
Loading required package: earth
Loading required package: leaps
Loading required package: plotmo
Loading required package: plotrix
> 
> set.seed(669)
> svmRFit <- train(CompressiveStrength ~ ., data = training,
+                  method = "svmRadial",
+                  tuneLength = 15,
+                  preProc = c("center", "scale"),
+                  trControl = ctrl)
Loading required package: kernlab
> 
> 
> nnetGrid <- expand.grid(decay = c(0.001, .01, .1), 
+                         size = seq(1, 27, by = 2), 
+                         bag = FALSE)
> set.seed(669)
> nnetFit <- train(CompressiveStrength ~ .,
+                  data = training,
+                  method = "avNNet",
+                  tuneGrid = nnetGrid,
+                  preProc = c("center", "scale"),
+                  linout = TRUE,
+                  trace = FALSE,
+                  maxit = 1000,
+                  allowParallel = FALSE,
+                  trControl = ctrl)
Loading required package: nnet
> 
> set.seed(669)
> rpartFit <- train(CompressiveStrength ~ .,
+                   data = training,
+                   method = "rpart",
+                   tuneLength = 30,
+                   trControl = ctrl)
Loading required package: rpart
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.
> 
> set.seed(669)
> treebagFit <- train(CompressiveStrength ~ .,
+                     data = training,
+                     method = "treebag",
+                     trControl = ctrl)
Loading required package: ipred
Loading required package: MASS
Loading required package: survival
Loading required package: splines

Attaching package: ‘survival’

The following object is masked from ‘package:caret’:

    cluster

Loading required package: class
Loading required package: prodlim
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
> 
> set.seed(669)
> ctreeFit <- train(CompressiveStrength ~ .,
+                   data = training,
+                   method = "ctree",
+                   tuneLength = 10,
+                   trControl = ctrl)
Loading required package: party
Loading required package: grid
Loading required package: modeltools
Loading required package: stats4

Attaching package: ‘modeltools’

The following object is masked from ‘package:kernlab’:

    prior

The following object is masked from ‘package:plyr’:

    empty

Loading required package: coin
Loading required package: mvtnorm
Loading required package: zoo

Attaching package: ‘zoo’

The following object is masked from ‘package:base’:

    as.Date, as.Date.numeric

Loading required package: sandwich
Loading required package: strucchange
Loading required package: vcd
Loading required package: colorspace
> 
> set.seed(669)
> rfFit <- train(CompressiveStrength ~ .,
+                data = training,
+                method = "rf",
+                tuneLength = 10,
+                ntrees = 1000,
+                importance = TRUE,
+                trControl = ctrl)
Loading required package: randomForest
randomForest 4.6-7
Type rfNews() to see new features/changes/bug fixes.
note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .

> 
> 
> gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
+                        n.trees = seq(100, 1000, by = 50),
+                        shrinkage = c(0.01, 0.1))
> set.seed(669)
> gbmFit <- train(CompressiveStrength ~ .,
+                 data = training,
+                 method = "gbm",
+                 tuneGrid = gbmGrid,
+                 verbose = FALSE,
+                 trControl = ctrl)
Loading required package: gbm
Loaded gbm 2.1
> 
> 
> cbGrid <- expand.grid(committees = c(1, 5, 10, 50, 75, 100), 
+                       neighbors = c(0, 1, 3, 5, 7, 9))
> set.seed(669)
> cbFit <- train(CompressiveStrength ~ .,
+                data = training,
+                method = "cubist",
+                tuneGrid = cbGrid,
+                trControl = ctrl)
Loading required package: Cubist
Loading required package: reshape2
> 
> ### Turn off the parallel processing to use RWeka. 
> registerDoSEQ()
> 
> 
> set.seed(669)
> mtFit <- train(CompressiveStrength ~ .,
+                data = training,
+                method = "M5",
+                trControl = ctrl)
Loading required package: RWeka
Warning message:
In train.default(x, y, weights = w, ...) :
  Models using Weka will not work with parallel processing with multicore/doMC
> 
> ################################################################################
> ### Section 10.2 Model Performance
> 
> ### Collect the resampling statistics across all the models
> 
> rs <- resamples(list("Linear Reg" = lmFit, "
+                      PLS" = plsFit,
+                      "Elastic Net" = lassoFit, 
+                      MARS = earthFit,
+                      SVM = svmRFit, 
+                      "Neural Networks" = nnetFit,
+                      CART = rpartFit, 
+                      "Cond Inf Tree" = ctreeFit,
+                      "Bagged Tree" = treebagFit,
+                      "Boosted Tree" = gbmFit,
+                      "Random Forest" = rfFit,
+                      Cubist = cbFit))
> 
> #parallelPlot(rs)
> #parallelPlot(rs, metric = "Rsquared")
> 
> ### Get the test set results across several models
> 
> nnetPred <- predict(nnetFit, testing)
> gbmPred <- predict(gbmFit, testing)
> cbPred <- predict(cbFit, testing)
> 
> testResults <- rbind(postResample(nnetPred, testing$CompressiveStrength),
+                      postResample(gbmPred, testing$CompressiveStrength),
+                      postResample(cbPred, testing$CompressiveStrength))
> testResults <- as.data.frame(testResults)
> testResults$Model <- c("Neural Networks", "Boosted Tree", "Cubist")
> testResults <- testResults[order(testResults$RMSE),]
> 
> ################################################################################
> ### Section 10.3 Optimizing Compressive Strength
> 
> library(proxy)

Attaching package: ‘proxy’

The following object is masked from ‘package:stats’:

    as.dist, dist

> 
> ### Create a function to maximize compressive strength* while keeping
> ### the predictor values as mixtures. Water (in x[7]) is used as the 
> ### 'slack variable'. 
> 
> ### * We are actually minimizing the negative compressive strength
> 
> modelPrediction <- function(x, mod, limit = 2500)
+ {
+   if(x[1] < 0 | x[1] > 1) return(10^38)
+   if(x[2] < 0 | x[2] > 1) return(10^38)
+   if(x[3] < 0 | x[3] > 1) return(10^38)
+   if(x[4] < 0 | x[4] > 1) return(10^38)
+   if(x[5] < 0 | x[5] > 1) return(10^38)
+   if(x[6] < 0 | x[6] > 1) return(10^38)
+   
+   x <- c(x, 1 - sum(x))
+   
+   if(x[7] < 0.05) return(10^38)
+   
+   tmp <- as.data.frame(t(x))
+   names(tmp) <- c('Cement','BlastFurnaceSlag','FlyAsh',
+                   'Superplasticizer','CoarseAggregate',
+                   'FineAggregate', 'Water')
+   tmp$Age <- 28
+   -predict(mod, tmp)
+ }
> 
> ### Get mixtures at 28 days 
> subTrain <- subset(training, Age == 28)
> 
> ### Center and scale the data to use dissimilarity sampling
> pp1 <- preProcess(subTrain[, -(8:9)], c("center", "scale"))
> scaledTrain <- predict(pp1, subTrain[, 1:7])
> 
> ### Randomly select a few mixtures as a starting pool
> 
> set.seed(91)
> startMixture <- sample(1:nrow(subTrain), 1)
> starters <- scaledTrain[startMixture, 1:7]
> pool <- scaledTrain
> index <- maxDissim(starters, pool, 14)
> startPoints <- c(startMixture, index)
> 
> starters <- subTrain[startPoints,1:7]
> startingValues <- starters[, -4]
> 
> ### For each starting mixture, optimize the Cubist model using
> ### a simplex search routine
> 
> cbResults <- startingValues
> cbResults$Water <- NA
> cbResults$Prediction <- NA
> 
> for(i in 1:nrow(cbResults))
+ {
+   results <- optim(unlist(cbResults[i,1:6]),
+                    modelPrediction,
+                    method = "Nelder-Mead",
+                    control=list(maxit=5000),
+                    mod = cbFit)
+   cbResults$Prediction[i] <- -results$value
+   cbResults[i,1:6] <- results$par
+ }
> cbResults$Water <- 1 - apply(cbResults[,1:6], 1, sum)
> cbResults <- subset(cbResults, Prediction > 0 & Water > .02)
> cbResults <- cbResults[order(-cbResults$Prediction),][1:3,]
> cbResults$Model <- "Cubist"
> 
> ### Do the same for the neural network model
> 
> nnetResults <- startingValues
> nnetResults$Water <- NA
> nnetResults$Prediction <- NA
> 
> for(i in 1:nrow(nnetResults))
+ {
+   results <- optim(unlist(nnetResults[i, 1:6,]),
+                    modelPrediction,
+                    method = "Nelder-Mead",
+                    control=list(maxit=5000),
+                    mod = nnetFit)
+   nnetResults$Prediction[i] <- -results$value
+   nnetResults[i,1:6] <- results$par
+ }
> nnetResults$Water <- 1 - apply(nnetResults[,1:6], 1, sum)
> nnetResults <- subset(nnetResults, Prediction > 0 & Water > .02)
> nnetResults <- nnetResults[order(-nnetResults$Prediction),][1:3,]
> nnetResults$Model <- "NNet"
> 
> ### Convert the predicted mixtures to PCA space and plot
> 
> pp2 <- preProcess(subTrain[, 1:7], "pca")
> pca1 <- predict(pp2, subTrain[, 1:7])
> pca1$Data <- "Training Set"
> pca1$Data[startPoints] <- "Starting Values"
> pca3 <- predict(pp2, cbResults[, names(subTrain[, 1:7])])
> pca3$Data <- "Cubist"
> pca4 <- predict(pp2, nnetResults[, names(subTrain[, 1:7])])
> pca4$Data <- "Neural Network"
> 
> pcaData <- rbind(pca1, pca3, pca4)
> pcaData$Data <- factor(pcaData$Data,
+                        levels = c("Training Set","Starting Values",
+                                   "Cubist","Neural Network"))
> 
> lim <- extendrange(pcaData[, 1:2])
> 
> xyplot(PC2 ~ PC1, 
+        data = pcaData, 
+        groups = Data,
+        auto.key = list(columns = 2),
+        xlim = lim, 
+        ylim = lim,
+        type = c("g", "p"))
> 
> 
> ################################################################################
> ### Session Information
> 
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] stats4    grid      splines   parallel  stats     graphics  grDevices
 [8] utils     datasets  methods   base     

other attached packages:
 [1] proxy_0.4-9                     RWeka_0.4-15                   
 [3] Cubist_0.0.13                   reshape2_1.2.2                 
 [5] gbm_2.1                         randomForest_4.6-7             
 [7] party_1.0-6                     vcd_1.2-13                     
 [9] colorspace_1.2-1                strucchange_1.4-7              
[11] sandwich_2.2-9                  zoo_1.7-9                      
[13] coin_1.0-21                     mvtnorm_0.9-9994               
[15] modeltools_0.2-19               ipred_0.9-1                    
[17] prodlim_1.3.3                   class_7.3-7                    
[19] survival_2.37-4                 MASS_7.3-26                    
[21] rpart_4.1-1                     nnet_7.3-6                     
[23] kernlab_0.9-16                  earth_3.2-3                    
[25] plotrix_3.4-6                   plotmo_1.3-2                   
[27] leaps_2.9                       elasticnet_1.1                 
[29] lars_1.1                        pls_2.3-0                      
[31] doMC_1.3.0                      iterators_1.0.6                
[33] foreach_1.4.0                   plyr_1.8                       
[35] caret_6.0-22                    ggplot2_0.9.3.1                
[37] lattice_0.20-15                 AppliedPredictiveModeling_1.1-5

loaded via a namespace (and not attached):
 [1] car_2.0-16         codetools_0.2-8    compiler_3.0.1     CORElearn_0.9.41  
 [5] dichromat_2.0-0    digest_0.6.3       gtable_0.1.2       KernSmooth_2.23-10
 [9] labeling_0.1       munsell_0.4        proto_0.3-10       RColorBrewer_1.0-5
[13] rJava_0.9-4        RWekajars_3.7.8-1  scales_0.2.3       stringr_0.6.2     
[17] tools_3.0.1       
> 
> q("no")
> proc.time()
     user    system   elapsed 
20277.196   121.470  4043.395 
In [65]:
%%R

# Try this if you are very patient --
# in the APM version of the output file:

##############   THE RUN TIME FOR THIS SCRIPT IS LISTED AS 5.6 HOURS.

# Chs 10 and 17 evaluate many different models in case studies.
# To run the Ch.10 script:

VERY_PATIENT = FALSE

if (VERY_PATIENT) {
   current_working_directory = getwd()  # remember current directory

   chapter_code_directory = scriptLocation()

   setwd( chapter_code_directory )
   print(dir())

   print(source("10_Case_Study_Concrete.R", echo=TRUE))

   setwd(current_working_directory)  # return to working directory
}

##       user    system   elapsed 
##  20277.196   121.470  4043.395
In [72]:
%%R

showChapterScript(11)
NULL
In [73]:
%%R

showChapterOutput(11)
NULL
In [129]:
%%R -w 600 -h 600

runChapterScript(11)

##     user  system elapsed 
##   11.120   0.526  11.698
NULL
In [81]:
%%R

### Section 11.1 Class Predictions

library(AppliedPredictiveModeling)

### Simulate some two class data with two predictors
set.seed(975)
training <- quadBoundaryFunc(500)
testing <- quadBoundaryFunc(1000)
testing$class2 <- ifelse(testing$class == "Class1", 1, 0)
testing$ID <- 1:nrow(testing)

### Fit models
library(MASS)
qdaFit <- qda(class ~ X1 + X2, data = training)

library(randomForest)
rfFit <- randomForest(class ~ X1 + X2, data = training, ntree = 2000)

### Predict the test set
testing$qda <- predict(qdaFit, testing)$posterior[,1]
testing$rf <- predict(rfFit, testing, type = "prob")[,1]


### Generate the calibration analysis
library(caret)
calData1 <- calibration(class ~ qda + rf, data = testing, cuts = 10)

### Plot the curve
print(
xyplot(calData1, auto.key = list(columns = 2))
)
randomForest 4.6-10
Type rfNews() to see new features/changes/bug fixes.
In [82]:
%%R

### To calibrate the data, treat the probabilities as inputs into the
### model

trainProbs <- training
trainProbs$qda <- predict(qdaFit)$posterior[,1]

### These models take the probabilities as inputs and, based on the
### true class, re-calibrate them.
library(klaR)
nbCal <- NaiveBayes(class ~ qda, data = trainProbs, usekernel = TRUE)

### We use relevel() here because glm() models the probability of the
### second factor level.
lrCal <- glm(relevel(class, "Class2") ~ qda, data = trainProbs, family = binomial)

### Now re-predict the test set using the modified class probability
### estimates
testing$qda2 <- predict(nbCal, testing[, "qda", drop = FALSE])$posterior[,1]
testing$qda3 <- predict(lrCal, testing[, "qda", drop = FALSE], type = "response")


### Manipulate the data a bit for pretty plotting
simulatedProbs <- testing[, c("class", "rf", "qda3")]
names(simulatedProbs) <- c("TrueClass", "RandomForestProb", "QDACalibrated")
simulatedProbs$RandomForestClass <-  predict(rfFit, testing)

calData2 <- calibration(class ~ qda + qda2 + qda3, data = testing)
calData2$data$calibModelVar <- as.character(calData2$data$calibModelVar)
calData2$data$calibModelVar <- ifelse(calData2$data$calibModelVar == "qda",
                                      "QDA",
                                      calData2$data$calibModelVar)
calData2$data$calibModelVar <- ifelse(calData2$data$calibModelVar == "qda2",
                                      "Bayesian Calibration",
                                      calData2$data$calibModelVar)

calData2$data$calibModelVar <- ifelse(calData2$data$calibModelVar == "qda3",
                                      "Sigmoidal Calibration",
                                      calData2$data$calibModelVar)

calData2$data$calibModelVar <- factor(calData2$data$calibModelVar,
                                      levels = c("QDA",
                                                 "Bayesian Calibration",
                                                 "Sigmoidal Calibration"))
print(
xyplot(calData2, auto.key = list(columns = 1))
)
In [119]:
%%R

## These commands are needed to reload GermanCredit, which is changed by this and Ch.4 code:

detach(package:caret)
library(caret)
data(GermanCredit)

## First, remove near-zero variance predictors then get rid of a few predictors
## that duplicate values. For example, there are two possible values for the
## housing variable: "Rent", "Own" and "ForFree". So that we don't have linear
## dependencies, we get rid of one of the levels (e.g. "ForFree")

GermanCredit <- GermanCredit[, -nearZeroVar(GermanCredit)]
GermanCredit$CheckingAccountStatus.lt.0 <- NULL
GermanCredit$SavingsAccountBonds.lt.100 <- NULL
GermanCredit$EmploymentDuration.lt.1 <- NULL
GermanCredit$EmploymentDuration.Unemployed <- NULL
GermanCredit$Personal.Male.Married.Widowed <- NULL
GermanCredit$Property.Unknown <- NULL
GermanCredit$Housing.ForFree <- NULL

## Split the data into training (80%) and test sets (20%)
set.seed(100)
inTrain <- createDataPartition(GermanCredit$Class, p = .8)[[1]]
GermanCreditTrain <- GermanCredit[ inTrain, ]
GermanCreditTest  <- GermanCredit[-inTrain, ]

set.seed(1056)
logisticReg <- train(Class ~ .,
                     data = GermanCreditTrain,
                     method = "glm",
                     trControl = trainControl(method = "repeatedcv",
                                              repeats = 5))
print(
logisticReg
)

### Predict the test set
creditResults <- data.frame(obs = GermanCreditTest$Class)
creditResults$prob <- predict(logisticReg, GermanCreditTest, type = "prob")[, "Bad"]
creditResults$pred <- predict(logisticReg, GermanCreditTest)
creditResults$Label <- ifelse(creditResults$obs == "Bad",
                              "True Outcome: Bad Credit",
                              "True Outcome: Good Credit")

### Plot the probability of bad credit
print(
histogram(~prob|Label,
          data = creditResults,
          layout = c(2, 1),
          nint = 20,
          xlab = "Probability of Bad Credit",
          type = "count")
)

### Calculate and plot the calibration curve
creditCalib <- calibration(obs ~ prob, data = creditResults)

print(
xyplot(creditCalib)
)

### Create the confusion matrix from the test set.
print(
confusionMatrix(data = creditResults$pred,
                reference = creditResults$obs)
)
Generalized Linear Model 

800 samples
 41 predictor
  2 classes: 'Bad', 'Good' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 

Resampling results

  Accuracy  Kappa      Accuracy SD  Kappa SD 
  0.749     0.3647664  0.05162166   0.1218109

 
Confusion Matrix and Statistics

          Reference
Prediction Bad Good
      Bad   24   10
      Good  36  130
                                          
               Accuracy : 0.77            
                 95% CI : (0.7054, 0.8264)
    No Information Rate : 0.7             
    P-Value [Acc > NIR] : 0.0168694       
                                          
                  Kappa : 0.375           
 Mcnemar's Test P-Value : 0.0002278       
                                          
            Sensitivity : 0.4000          
            Specificity : 0.9286          
         Pos Pred Value : 0.7059          
         Neg Pred Value : 0.7831          
             Prevalence : 0.3000          
         Detection Rate : 0.1200          
   Detection Prevalence : 0.1700          
      Balanced Accuracy : 0.6643          
                                          
       'Positive' Class : Bad             
                                          
In [120]:
%%R

### ROC curves:

### Like glm(), roc() treats the last level of the factor as the event
### of interest so we use relevel() to change the observed class data

library(pROC)
creditROC <- roc(relevel(creditResults$obs, "Good"), creditResults$prob)

coords(creditROC, "all")[,1:3]

print(
auc(creditROC)
)

print(   
ci.auc(creditROC)
)

### Note the x-axis is reversed
plot(creditROC)

### Old-school:
plot(creditROC, legacy.axes = TRUE)

### Lift charts

creditLift <- lift(obs ~ prob, data = creditResults)

print(
xyplot(creditLift)
)
Area under the curve: 0.775
95% CI: 0.7032-0.8468 (DeLong)
In [121]:
%%R
summary(GermanCredit)
    Duration        Amount      InstallmentRatePercentage ResidenceDuration
 Min.   : 4.0   Min.   :  250   Min.   :1.000             Min.   :1.000    
 1st Qu.:12.0   1st Qu.: 1366   1st Qu.:2.000             1st Qu.:2.000    
 Median :18.0   Median : 2320   Median :3.000             Median :3.000    
 Mean   :20.9   Mean   : 3271   Mean   :2.973             Mean   :2.845    
 3rd Qu.:24.0   3rd Qu.: 3972   3rd Qu.:4.000             3rd Qu.:4.000    
 Max.   :72.0   Max.   :18424   Max.   :4.000             Max.   :4.000    
      Age        NumberExistingCredits NumberPeopleMaintenance   Telephone    
 Min.   :19.00   Min.   :1.000         Min.   :1.000           Min.   :0.000  
 1st Qu.:27.00   1st Qu.:1.000         1st Qu.:1.000           1st Qu.:0.000  
 Median :33.00   Median :1.000         Median :1.000           Median :1.000  
 Mean   :35.55   Mean   :1.407         Mean   :1.155           Mean   :0.596  
 3rd Qu.:42.00   3rd Qu.:2.000         3rd Qu.:1.000           3rd Qu.:1.000  
 Max.   :75.00   Max.   :4.000         Max.   :2.000           Max.   :1.000  
  Class     CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
 Bad :300   Min.   :0.000                  Min.   :0.000               
 Good:700   1st Qu.:0.000                  1st Qu.:0.000               
            Median :0.000                  Median :0.000               
            Mean   :0.269                  Mean   :0.063               
            3rd Qu.:1.000                  3rd Qu.:0.000               
            Max.   :1.000                  Max.   :1.000               
 CheckingAccountStatus.none CreditHistory.PaidDuly CreditHistory.Delay
 Min.   :0.000              Min.   :0.00           Min.   :0.000      
 1st Qu.:0.000              1st Qu.:0.00           1st Qu.:0.000      
 Median :0.000              Median :1.00           Median :0.000      
 Mean   :0.394              Mean   :0.53           Mean   :0.088      
 3rd Qu.:1.000              3rd Qu.:1.00           3rd Qu.:0.000      
 Max.   :1.000              Max.   :1.00           Max.   :1.000      
 CreditHistory.Critical Purpose.NewCar  Purpose.UsedCar
 Min.   :0.000          Min.   :0.000   Min.   :0.000  
 1st Qu.:0.000          1st Qu.:0.000   1st Qu.:0.000  
 Median :0.000          Median :0.000   Median :0.000  
 Mean   :0.293          Mean   :0.234   Mean   :0.103  
 3rd Qu.:1.000          3rd Qu.:0.000   3rd Qu.:0.000  
 Max.   :1.000          Max.   :1.000   Max.   :1.000  
 Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.Education
 Min.   :0.000               Min.   :0.00             Min.   :0.00     
 1st Qu.:0.000               1st Qu.:0.00             1st Qu.:0.00     
 Median :0.000               Median :0.00             Median :0.00     
 Mean   :0.181               Mean   :0.28             Mean   :0.05     
 3rd Qu.:0.000               3rd Qu.:1.00             3rd Qu.:0.00     
 Max.   :1.000               Max.   :1.00             Max.   :1.00     
 Purpose.Business SavingsAccountBonds.100.to.500
 Min.   :0.000    Min.   :0.000                 
 1st Qu.:0.000    1st Qu.:0.000                 
 Median :0.000    Median :0.000                 
 Mean   :0.097    Mean   :0.103                 
 3rd Qu.:0.000    3rd Qu.:0.000                 
 Max.   :1.000    Max.   :1.000                 
 SavingsAccountBonds.500.to.1000 SavingsAccountBonds.Unknown
 Min.   :0.000                   Min.   :0.000              
 1st Qu.:0.000                   1st Qu.:0.000              
 Median :0.000                   Median :0.000              
 Mean   :0.063                   Mean   :0.183              
 3rd Qu.:0.000                   3rd Qu.:0.000              
 Max.   :1.000                   Max.   :1.000              
 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7 EmploymentDuration.gt.7
 Min.   :0.000             Min.   :0.000             Min.   :0.000          
 1st Qu.:0.000             1st Qu.:0.000             1st Qu.:0.000          
 Median :0.000             Median :0.000             Median :0.000          
 Mean   :0.339             Mean   :0.174             Mean   :0.253          
 3rd Qu.:1.000             3rd Qu.:0.000             3rd Qu.:1.000          
 Max.   :1.000             Max.   :1.000             Max.   :1.000          
 Personal.Male.Divorced.Seperated Personal.Female.NotSingle
 Min.   :0.00                     Min.   :0.00             
 1st Qu.:0.00                     1st Qu.:0.00             
 Median :0.00                     Median :0.00             
 Mean   :0.05                     Mean   :0.31             
 3rd Qu.:0.00                     3rd Qu.:1.00             
 Max.   :1.00                     Max.   :1.00             
 Personal.Male.Single OtherDebtorsGuarantors.None
 Min.   :0.000        Min.   :0.000              
 1st Qu.:0.000        1st Qu.:1.000              
 Median :1.000        Median :1.000              
 Mean   :0.548        Mean   :0.907              
 3rd Qu.:1.000        3rd Qu.:1.000              
 Max.   :1.000        Max.   :1.000              
 OtherDebtorsGuarantors.Guarantor Property.RealEstate Property.Insurance
 Min.   :0.000                    Min.   :0.000       Min.   :0.000     
 1st Qu.:0.000                    1st Qu.:0.000       1st Qu.:0.000     
 Median :0.000                    Median :0.000       Median :0.000     
 Mean   :0.052                    Mean   :0.282       Mean   :0.232     
 3rd Qu.:0.000                    3rd Qu.:1.000       3rd Qu.:0.000     
 Max.   :1.000                    Max.   :1.000       Max.   :1.000     
 Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.None
 Min.   :0.000     Min.   :0.000              Min.   :0.000             
 1st Qu.:0.000     1st Qu.:0.000              1st Qu.:1.000             
 Median :0.000     Median :0.000              Median :1.000             
 Mean   :0.332     Mean   :0.139              Mean   :0.814             
 3rd Qu.:1.000     3rd Qu.:0.000              3rd Qu.:1.000             
 Max.   :1.000     Max.   :1.000              Max.   :1.000             
  Housing.Rent    Housing.Own    Job.UnskilledResident Job.SkilledEmployee
 Min.   :0.000   Min.   :0.000   Min.   :0.0           Min.   :0.00       
 1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.0           1st Qu.:0.00       
 Median :0.000   Median :1.000   Median :0.0           Median :1.00       
 Mean   :0.179   Mean   :0.713   Mean   :0.2           Mean   :0.63       
 3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:0.0           3rd Qu.:1.00       
 Max.   :1.000   Max.   :1.000   Max.   :1.0           Max.   :1.00       
 Job.Management.SelfEmp.HighlyQualified
 Min.   :0.000                         
 1st Qu.:0.000                         
 Median :0.000                         
 Mean   :0.148                         
 3rd Qu.:0.000                         
 Max.   :1.000                         
In [85]:
%%R

showChapterScript(12)
NULL
In [70]:
%%R

showChapterOutput(12)
R Information
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com) 
> ###
> ### Chapter 12 Discriminant Analysis and Other Linear Classification Models
> ###
> ### Required packages: AppliedPredictiveModeling, caret, doMC (optional),  
> ###                    glmnet, lattice, MASS, pamr, pls, pROC, sparseLDA
> ###
> ### Data used: The grant application data. See the file 'CreateGrantData.R'
> ###
> ### Notes: 
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be 
> ### syntax differences that occur over time as packages evolve. These files 
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in 
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
> 
> ################################################################################
> ### Section 12.1 Case Study: Predicting Successful Grant Applications
> 
> load("grantData.RData")
> 
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> library(doMC)
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
> registerDoMC(12)
> library(plyr)
> library(reshape2)
> 
> ## Look at two different ways to split and resample the data. A support vector
> ## machine is used to illustrate the differences. The full set of predictors
> ## is used. 
> 
> pre2008Data <- training[pre2008,]
> year2008Data <- rbind(training[-pre2008,], testing)
> 
> set.seed(552)
> test2008 <- createDataPartition(year2008Data$Class, p = .25)[[1]]
> 
> allData <- rbind(pre2008Data, year2008Data[-test2008,])
> holdout2008 <- year2008Data[test2008,]
> 
> ## Use a common tuning grid for both approaches. 
> svmrGrid <- expand.grid(sigma = c(.00007, .00009, .0001, .0002),
+                         C = 2^(-3:8))
> 
> ## Evaluate the model using overall 10-fold cross-validation
> ctrl0 <- trainControl(method = "cv",
+                       summaryFunction = twoClassSummary,
+                       classProbs = TRUE)
> set.seed(477)
> svmFit0 <- train(pre2008Data[,fullSet], pre2008Data$Class,
+                  method = "svmRadial",
+                  tuneGrid = svmrGrid,
+                  preProc = c("center", "scale"),
+                  metric = "ROC",
+                  trControl = ctrl0)
Loading required package: kernlab
Loading required package: pROC
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'

The following object is masked from 'package:stats':

    cov, smooth, var

> svmFit0
Support Vector Machines with Radial Basis Function Kernel 

6633 samples
1070 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 5970, 5970, 5969, 5970, 5970, 5969, ... 

Resampling results across tuning parameters:

  sigma  C      ROC    Sens   Spec   ROC SD  Sens SD  Spec SD
  7e-05  0.125  0.806  0.88   0.562  0.0231  0.023    0.0168 
  7e-05  0.25   0.81   0.876  0.574  0.022   0.0254   0.0157 
  7e-05  0.5    0.836  0.837  0.677  0.018   0.029    0.0194 
  7e-05  1      0.853  0.803  0.757  0.0173  0.0308   0.0288 
  7e-05  2      0.863  0.805  0.78   0.0177  0.0275   0.0318 
  7e-05  4      0.869  0.8    0.789  0.0168  0.0279   0.0285 
  7e-05  8      0.874  0.798  0.798  0.0189  0.0313   0.0279 
  7e-05  16     0.876  0.796  0.797  0.0193  0.03     0.0235 
  7e-05  32     0.877  0.793  0.801  0.0184  0.0242   0.0287 
  7e-05  64     0.877  0.793  0.81   0.0178  0.034    0.0182 
  7e-05  128    0.876  0.793  0.812  0.0163  0.0233   0.0164 
  7e-05  256    0.873  0.794  0.812  0.0165  0.0239   0.0162 
  9e-05  0.125  0.8    0.876  0.551  0.0249  0.0209   0.023  
  9e-05  0.25   0.811  0.87   0.581  0.0219  0.0236   0.0186 
  9e-05  0.5    0.842  0.816  0.715  0.018   0.031    0.0258 
  9e-05  1      0.856  0.8    0.769  0.0176  0.0314   0.0306 
  9e-05  2      0.866  0.801  0.785  0.0173  0.0277   0.0315 
  9e-05  4      0.871  0.8    0.792  0.0172  0.0271   0.0269 
  9e-05  8      0.875  0.796  0.796  0.0188  0.0295   0.0259 
  9e-05  16     0.877  0.795  0.8    0.0186  0.0258   0.0246 
  9e-05  32     0.878  0.793  0.804  0.0179  0.0291   0.025  
  9e-05  64     0.877  0.794  0.813  0.0169  0.0297   0.0187 
  9e-05  128    0.876  0.795  0.813  0.0156  0.0228   0.0153 
  9e-05  256    0.874  0.788  0.814  0.0164  0.0205   0.017  
  1e-04  0.125  0.797  0.878  0.546  0.0257  0.0241   0.016  
  1e-04  0.25   0.814  0.863  0.596  0.0212  0.0319   0.0189 
  1e-04  0.5    0.845  0.81   0.728  0.018   0.0296   0.0247 
  1e-04  1      0.857  0.799  0.771  0.0179  0.0321   0.0298 
  1e-04  2      0.867  0.804  0.785  0.0173  0.0285   0.0312 
  1e-04  4      0.872  0.801  0.794  0.0174  0.0279   0.0266 
  1e-04  8      0.875  0.792  0.797  0.0187  0.0304   0.0242 
  1e-04  16     0.878  0.794  0.799  0.0184  0.0249   0.025  
  1e-04  32     0.878  0.795  0.806  0.0179  0.0335   0.0222 
  1e-04  64     0.878  0.796  0.812  0.0163  0.0245   0.0168 
  1e-04  128    0.876  0.796  0.811  0.0159  0.0215   0.0143 
  1e-04  256    0.874  0.788  0.816  0.0165  0.0209   0.0127 
  2e-04  0.125  0.786  0.861  0.542  0.0282  0.0356   0.0198 
  2e-04  0.25   0.836  0.81   0.701  0.0192  0.0382   0.0232 
  2e-04  0.5    0.853  0.792  0.765  0.0177  0.0342   0.0308 
  2e-04  1      0.864  0.8    0.782  0.0177  0.028    0.036  
  2e-04  2      0.87   0.796  0.789  0.0174  0.0258   0.0277 
  2e-04  4      0.875  0.795  0.793  0.0182  0.0295   0.026  
  2e-04  8      0.878  0.793  0.801  0.0176  0.0293   0.0196 
  2e-04  16     0.879  0.796  0.809  0.0167  0.033    0.0203 
  2e-04  32     0.88   0.795  0.811  0.0153  0.0227   0.0169 
  2e-04  64     0.879  0.792  0.813  0.0155  0.0194   0.0171 
  2e-04  128    0.877  0.786  0.816  0.0162  0.0235   0.0128 
  2e-04  256    0.877  0.789  0.822  0.0156  0.0241   0.0159 

ROC was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 2e-04 and C = 32. 
> 
> ### Now fit the single 2008 test set
> ctrl00 <- trainControl(method = "LGOCV",
+                        summaryFunction = twoClassSummary,
+                        classProbs = TRUE,
+                        index = list(TestSet = 1:nrow(pre2008Data)))
> 
> 
> set.seed(476)
> svmFit00 <- train(allData[,fullSet], allData$Class,
+                   method = "svmRadial",
+                   tuneGrid = svmrGrid,
+                   preProc = c("center", "scale"),
+                   metric = "ROC",
+                   trControl = ctrl00)
> svmFit00
Support Vector Machines with Radial Basis Function Kernel 

8189 samples
1070 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  sigma  C      ROC    Sens   Spec 
  7e-05  0.125  0.806  0.968  0.494
  7e-05  0.25   0.814  0.965  0.512
  7e-05  0.5    0.855  0.921  0.651
  7e-05  1      0.873  0.882  0.753
  7e-05  2      0.882  0.873  0.783
  7e-05  4      0.886  0.856  0.802
  7e-05  8      0.887  0.835  0.813
  7e-05  16     0.883  0.812  0.813
  7e-05  32     0.875  0.786  0.814
  7e-05  64     0.872  0.794  0.816
  7e-05  128    0.872  0.791  0.807
  7e-05  256    0.869  0.793  0.811
  9e-05  0.125  0.798  0.97   0.478
  9e-05  0.25   0.819  0.96   0.536
  9e-05  0.5    0.864  0.902  0.688
  9e-05  1      0.876  0.868  0.765
  9e-05  2      0.885  0.863  0.785
  9e-05  4      0.888  0.84   0.807
  9e-05  8      0.887  0.822  0.806
  9e-05  16     0.88   0.801  0.816
  9e-05  32     0.874  0.791  0.821
  9e-05  64     0.873  0.8    0.811
  9e-05  128    0.872  0.791  0.812
  9e-05  256    0.865  0.775  0.803
  1e-04  0.125  0.795  0.961  0.476
  1e-04  0.25   0.825  0.946  0.563
  1e-04  0.5    0.867  0.895  0.709
  1e-04  1      0.877  0.87   0.765
  1e-04  2      0.885  0.858  0.786
  1e-04  4      0.888  0.835  0.804
  1e-04  8      0.887  0.819  0.809
  1e-04  16     0.88   0.791  0.818
  1e-04  32     0.875  0.798  0.814
  1e-04  64     0.874  0.796  0.809
  1e-04  128    0.871  0.794  0.807
  1e-04  256    0.863  0.78   0.79 
  2e-04  0.125  0.791  0.942  0.504
  2e-04  0.25   0.86   0.888  0.684
  2e-04  0.5    0.875  0.865  0.752
  2e-04  1      0.884  0.849  0.783
  2e-04  2      0.886  0.833  0.798
  2e-04  4      0.888  0.821  0.803
  2e-04  8      0.883  0.805  0.814
  2e-04  16     0.88   0.803  0.817
  2e-04  32     0.877  0.796  0.816
  2e-04  64     0.868  0.78   0.805
  2e-04  128    0.862  0.779  0.791
  2e-04  256    0.857  0.78   0.779

ROC was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 2e-04 and C = 4. 
> 
> ## Combine the two sets of results and plot
> 
> grid0 <- subset(svmFit0$results,  sigma == svmFit0$bestTune$sigma)
> grid0$Model <- "10-Fold Cross-Validation"
> 
> grid00 <- subset(svmFit00$results,  sigma == svmFit00$bestTune$sigma)
> grid00$Model <- "Single 2008 Test Set"
> 
> plotData <- rbind(grid00, grid0)
> 
> plotData <- plotData[!is.na(plotData$ROC),]
> xyplot(ROC ~ C, data = plotData,
+        groups = Model,
+        type = c("g", "o"),
+        scales = list(x = list(log = 2)),
+        auto.key = list(columns = 1))
> 
> ################################################################################
> ### Section 12.2 Logistic Regression
> 
> modelFit <- glm(Class ~ Day, data = training[pre2008,], family = binomial)
> dataGrid <- data.frame(Day = seq(0, 365, length = 500))
> dataGrid$Linear <- 1 - predict(modelFit, dataGrid, type = "response")
> linear2008 <- auc(roc(response = training[-pre2008, "Class"],
+                       predictor = 1 - predict(modelFit, 
+                                               training[-pre2008,], 
+                                               type = "response"),
+                       levels = rev(levels(training[-pre2008, "Class"]))))
> 
> 
> modelFit2 <- glm(Class ~ Day + I(Day^2), 
+                  data = training[pre2008,], 
+                  family = binomial)
> dataGrid$Quadratic <- 1 - predict(modelFit2, dataGrid, type = "response")
> quad2008 <- auc(roc(response = training[-pre2008, "Class"],
+                     predictor = 1 - predict(modelFit2, 
+                                             training[-pre2008,], 
+                                             type = "response"),
+                     levels = rev(levels(training[-pre2008, "Class"]))))
> 
> dataGrid <- melt(dataGrid, id.vars = "Day")
> 
> byDay <- training[pre2008, c("Day", "Class")]
> byDay$Binned <- cut(byDay$Day, seq(0, 360, by = 5))
> 
> observedProps <- ddply(byDay, .(Binned),
+                        function(x) c(n = nrow(x), mean = mean(x$Class == "successful")))
> observedProps$midpoint <- seq(2.5, 357.5, by = 5)
> 
> xyplot(value ~ Day|variable, data = dataGrid,
+        ylab = "Probability of A Successful Grant",
+        ylim = extendrange(0:1),
+        between = list(x = 1),
+        panel = function(...)
+        {
+          panel.xyplot(x = observedProps$midpoint, observedProps$mean,
+                       pch = 16., col = rgb(.2, .2, .2, .5))
+          panel.xyplot(..., type = "l", col = "black", lwd = 2)
+        })
> 
> ## For the reduced set of factors, fit the logistic regression model (linear and
> ## quadratic) and evaluate on the 
> training$Day2 <- training$Day^2
> testing$Day2 <- testing$Day^2
> fullSet <- c(fullSet, "Day2")
> reducedSet <- c(reducedSet, "Day2")
> 
> ## This control object will be used across multiple models so that the
> ## data splitting is consistent
> 
> ctrl <- trainControl(method = "LGOCV",
+                      summaryFunction = twoClassSummary,
+                      classProbs = TRUE,
+                      index = list(TrainSet = pre2008),
+                      savePredictions = TRUE)
> 
> set.seed(476)
> lrFit <- train(x = training[,reducedSet], 
+                y = training$Class,
+                method = "glm",
+                metric = "ROC",
+                trControl = ctrl)
> lrFit
Generalized Linear Model 

8190 samples
 253 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results

  ROC    Sens   Spec 
  0.872  0.804  0.822

 
> set.seed(476)
> lrFit2 <- train(x = training[,c(fullSet, "Day2")], 
+                 y = training$Class,
+                 method = "glm",
+                 metric = "ROC",
+                 trControl = ctrl)
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
3: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==  :
  prediction from a rank-deficient fit may be misleading
4: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==  :
  prediction from a rank-deficient fit may be misleading
5: glm.fit: fitted probabilities numerically 0 or 1 occurred 
> lrFit2
Generalized Linear Model 

8190 samples
1072 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results

  ROC    Sens  Spec 
  0.782  0.77  0.761

 
> 
> lrFit$pred <- merge(lrFit$pred,  lrFit$bestTune)
> 
> ## Get the confusion matrices for the hold-out set
> lrCM <- confusionMatrix(lrFit, norm = "none")
> lrCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Loading required package: class
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          458          176
  unsuccessful        112          811
                                         
               Accuracy : 0.815          
                 95% CI : (0.7948, 0.834)
    No Information Rate : 0.6339         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.6107         
 Mcnemar's Test P-Value : 0.0002054      
                                         
            Sensitivity : 0.8035         
            Specificity : 0.8217         
         Pos Pred Value : 0.7224         
         Neg Pred Value : 0.8787         
             Prevalence : 0.3661         
         Detection Rate : 0.2942         
   Detection Prevalence : 0.4072         
      Balanced Accuracy : 0.8126         
                                         
       'Positive' Class : successful     
                                         

> lrCM2 <- confusionMatrix(lrFit2, norm = "none")
> lrCM2
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          439          236
  unsuccessful        131          751
                                          
               Accuracy : 0.7643          
                 95% CI : (0.7424, 0.7852)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5112          
 Mcnemar's Test P-Value : 5.675e-08       
                                          
            Sensitivity : 0.7702          
            Specificity : 0.7609          
         Pos Pred Value : 0.6504          
         Neg Pred Value : 0.8515          
             Prevalence : 0.3661          
         Detection Rate : 0.2820          
   Detection Prevalence : 0.4335          
      Balanced Accuracy : 0.7655          
                                          
       'Positive' Class : successful      
                                          

> 
> ## Get the area under the ROC curve for the hold-out set
> lrRoc <- roc(response = lrFit$pred$obs,
+              predictor = lrFit$pred$successful,
+              levels = rev(levels(lrFit$pred$obs)))
> lrRoc2 <- roc(response = lrFit2$pred$obs,
+               predictor = lrFit2$pred$successful,
+               levels = rev(levels(lrFit2$pred$obs)))
> lrImp <- varImp(lrFit, scale = FALSE)
> 
> plot(lrRoc, legacy.axes = TRUE)

Call:
roc.default(response = lrFit$pred$obs, predictor = lrFit$pred$successful,     levels = rev(levels(lrFit$pred$obs)))

Data: lrFit$pred$successful in 987 controls (lrFit$pred$obs unsuccessful) < 570 cases (lrFit$pred$obs successful).
Area under the curve: 0.8715
> 
> ################################################################################
> ### Section 12.3 Linear Discriminant Analysis
> 
> ## Fit the model to the reduced set
> set.seed(476)
> ldaFit <- train(x = training[,reducedSet], 
+                 y = training$Class,
+                 method = "lda",
+                 preProc = c("center","scale"),
+                 metric = "ROC",
+                 trControl = ctrl)
Loading required package: MASS
> ldaFit
Linear Discriminant Analysis 

8190 samples
 253 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results

  ROC    Sens   Spec 
  0.889  0.804  0.823

 
> 
> ldaFit$pred <- merge(ldaFit$pred,  ldaFit$bestTune)
> ldaCM <- confusionMatrix(ldaFit, norm = "none")
> ldaCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          458          175
  unsuccessful        112          812
                                          
               Accuracy : 0.8157          
                 95% CI : (0.7955, 0.8346)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6119          
 Mcnemar's Test P-Value : 0.0002525       
                                          
            Sensitivity : 0.8035          
            Specificity : 0.8227          
         Pos Pred Value : 0.7235          
         Neg Pred Value : 0.8788          
             Prevalence : 0.3661          
         Detection Rate : 0.2942          
   Detection Prevalence : 0.4066          
      Balanced Accuracy : 0.8131          
                                          
       'Positive' Class : successful      
                                          

> ldaRoc <- roc(response = ldaFit$pred$obs,
+               predictor = ldaFit$pred$successful,
+               levels = rev(levels(ldaFit$pred$obs)))
> plot(lrRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = lrFit$pred$obs, predictor = lrFit$pred$successful,     levels = rev(levels(lrFit$pred$obs)))

Data: lrFit$pred$successful in 987 controls (lrFit$pred$obs unsuccessful) < 570 cases (lrFit$pred$obs successful).
Area under the curve: 0.8715
> plot(ldaRoc, add = TRUE, type = "s", legacy.axes = TRUE)

Call:
roc.default(response = ldaFit$pred$obs, predictor = ldaFit$pred$successful,     levels = rev(levels(ldaFit$pred$obs)))

Data: ldaFit$pred$successful in 987 controls (ldaFit$pred$obs unsuccessful) < 570 cases (ldaFit$pred$obs successful).
Area under the curve: 0.8892
> 
> ################################################################################
> ### Section 12.4 Partial Least Squares Discriminant Analysis
> 
> ## This model uses all of the predictors
> set.seed(476)
> plsFit <- train(x = training[,fullSet], 
+                 y = training$Class,
+                 method = "pls",
+                 tuneGrid = expand.grid(ncomp = 1:10),
+                 preProc = c("center","scale"),
+                 metric = "ROC",
+                 probMethod = "Bayes",
+                 trControl = ctrl)
Loading required package: pls

Attaching package: 'pls'

The following object is masked from 'package:caret':

    R2

The following object is masked from 'package:stats':

    loadings

> plsFit
Partial Least Squares 

8190 samples
1071 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  ncomp  ROC    Sens   Spec 
  1      0.821  0.863  0.667
  2      0.847  0.83   0.749
  3      0.863  0.851  0.749
  4      0.863  0.835  0.754
  5      0.864  0.839  0.77 
  6      0.87   0.837  0.77 
  7      0.865  0.816  0.776
  8      0.862  0.816  0.779
  9      0.864  0.825  0.778
  10     0.858  0.812  0.782

ROC was used to select the optimal model using  the largest value.
The final value used for the model was ncomp = 6. 
> 
> plsImpGrant <- varImp(plsFit, scale = FALSE)
> 
> bestPlsNcomp <- plsFit$results[best(plsFit$results, "ROC", maximize = TRUE), "ncomp"]
> bestPlsROC <- plsFit$results[best(plsFit$results, "ROC", maximize = TRUE), "ROC"]
> 
> ## Only keep the final tuning parameter data
> plsFit$pred <- merge(plsFit$pred,  plsFit$bestTune)
> 
> plsRoc <- roc(response = plsFit$pred$obs,
+               predictor = plsFit$pred$successful,
+               levels = rev(levels(plsFit$pred$obs)))
> 
> ### PLS confusion matrix information
> plsCM <- confusionMatrix(plsFit, norm = "none")
> plsCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          477          227
  unsuccessful         93          760
                                          
               Accuracy : 0.7945          
                 95% CI : (0.7735, 0.8143)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5781          
 Mcnemar's Test P-Value : 1.046e-13       
                                          
            Sensitivity : 0.8368          
            Specificity : 0.7700          
         Pos Pred Value : 0.6776          
         Neg Pred Value : 0.8910          
             Prevalence : 0.3661          
         Detection Rate : 0.3064          
   Detection Prevalence : 0.4522          
      Balanced Accuracy : 0.8034          
                                          
       'Positive' Class : successful      
                                          

> 
> ## Now fit a model that uses a smaller set of predictors chosen by unsupervised 
> ## filtering. 
> 
> set.seed(476)
> plsFit2 <- train(x = training[,reducedSet], 
+                  y = training$Class,
+                  method = "pls",
+                  tuneGrid = expand.grid(ncomp = 1:10),
+                  preProc = c("center","scale"),
+                  metric = "ROC",
+                  probMethod = "Bayes",
+                  trControl = ctrl)
> plsFit2
Partial Least Squares 

8190 samples
 253 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  ncomp  ROC    Sens   Spec 
  1      0.836  0.912  0.616
  2      0.868  0.858  0.752
  3      0.889  0.874  0.762
  4      0.895  0.86   0.777
  5      0.895  0.846  0.79 
  6      0.894  0.832  0.795
  7      0.89   0.823  0.806
  8      0.888  0.83   0.803
  9      0.887  0.83   0.803
  10     0.884  0.821  0.807

ROC was used to select the optimal model using  the largest value.
The final value used for the model was ncomp = 4. 
> 
> bestPlsNcomp2 <- plsFit2$results[best(plsFit2$results, "ROC", maximize = TRUE), "ncomp"]
> bestPlsROC2 <- plsFit2$results[best(plsFit2$results, "ROC", maximize = TRUE), "ROC"]
> 
> plsFit2$pred <- merge(plsFit2$pred,  plsFit2$bestTune)
> 
> plsRoc2 <- roc(response = plsFit2$pred$obs,
+                predictor = plsFit2$pred$successful,
+                levels = rev(levels(plsFit2$pred$obs)))
> plsCM2 <- confusionMatrix(plsFit2, norm = "none")
> plsCM2
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          490          220
  unsuccessful         80          767
                                          
               Accuracy : 0.8073          
                 95% CI : (0.7868, 0.8266)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6053          
 Mcnemar's Test P-Value : 1.014e-15       
                                          
            Sensitivity : 0.8596          
            Specificity : 0.7771          
         Pos Pred Value : 0.6901          
         Neg Pred Value : 0.9055          
             Prevalence : 0.3661          
         Detection Rate : 0.3147          
   Detection Prevalence : 0.4560          
      Balanced Accuracy : 0.8184          
                                          
       'Positive' Class : successful      
                                          

> 
> pls.ROC <- cbind(plsFit$results,Descriptors="Full Set")
> pls2.ROC <- cbind(plsFit2$results,Descriptors="Reduced Set")
> 
> plsCompareROC <- data.frame(rbind(pls.ROC,pls2.ROC))
> 
> xyplot(ROC ~ ncomp,
+        data = plsCompareROC,
+        xlab = "# Components",
+        ylab = "ROC (2008 Hold-Out Data)",
+        auto.key = list(columns = 2),
+        groups = Descriptors,
+        type = c("o", "g"))
> 
> ## Plot ROC curves and variable importance scores
> plot(ldaRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = ldaFit$pred$obs, predictor = ldaFit$pred$successful,     levels = rev(levels(ldaFit$pred$obs)))

Data: ldaFit$pred$successful in 987 controls (ldaFit$pred$obs unsuccessful) < 570 cases (ldaFit$pred$obs successful).
Area under the curve: 0.8892
> plot(lrRoc, type = "s", col = rgb(.2, .2, .2, .2), add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = lrFit$pred$obs, predictor = lrFit$pred$successful,     levels = rev(levels(lrFit$pred$obs)))

Data: lrFit$pred$successful in 987 controls (lrFit$pred$obs unsuccessful) < 570 cases (lrFit$pred$obs successful).
Area under the curve: 0.8715
> plot(plsRoc2, type = "s", add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = plsFit2$pred$obs, predictor = plsFit2$pred$successful,     levels = rev(levels(plsFit2$pred$obs)))

Data: plsFit2$pred$successful in 987 controls (plsFit2$pred$obs unsuccessful) < 570 cases (plsFit2$pred$obs successful).
Area under the curve: 0.895
> 
> plot(plsImpGrant, top=20, scales = list(y = list(cex = .95)))
> 
> ################################################################################
> ### Section 12.5 Penalized Models
> 
> ## The glmnet model
> glmnGrid <- expand.grid(alpha = c(0,  .1,  .2, .4, .6, .8, 1),
+                         lambda = seq(.01, .2, length = 40))
> set.seed(476)
> glmnFit <- train(x = training[,fullSet], 
+                  y = training$Class,
+                  method = "glmnet",
+                  tuneGrid = glmnGrid,
+                  preProc = c("center", "scale"),
+                  metric = "ROC",
+                  trControl = ctrl)
Loading required package: glmnet
Loading required package: Matrix
Loaded glmnet 1.9-3


Attaching package: 'glmnet'

The following object is masked from 'package:pROC':

    auc

> glmnFit
glmnet 

8190 samples
1071 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  alpha  lambda  ROC    Sens   Spec 
  0      0.01    0.856  0.8    0.791
  0      0.0149  0.856  0.8    0.791
  0      0.0197  0.856  0.8    0.791
  0      0.0246  0.858  0.804  0.796
  0      0.0295  0.86   0.804  0.801
  0      0.0344  0.861  0.802  0.8  
  0      0.0392  0.862  0.804  0.801
  0      0.0441  0.863  0.804  0.801
  0      0.049   0.863  0.798  0.8  
  0      0.0538  0.864  0.807  0.799
  0      0.0587  0.866  0.809  0.796
  0      0.0636  0.864  0.807  0.797
  0      0.0685  0.866  0.809  0.797
  0      0.0733  0.865  0.807  0.797
  0      0.0782  0.866  0.811  0.794
  0      0.0831  0.867  0.814  0.792
  0      0.0879  0.866  0.814  0.792
  0      0.0928  0.866  0.816  0.792
  0      0.0977  0.867  0.816  0.793
  0      0.103   0.867  0.819  0.791
  0      0.107   0.867  0.818  0.79 
  0      0.112   0.867  0.819  0.79 
  0      0.117   0.866  0.819  0.789
  0      0.122   0.866  0.816  0.791
  0      0.127   0.867  0.821  0.794
  0      0.132   0.866  0.819  0.791
  0      0.137   0.866  0.823  0.789
  0      0.142   0.867  0.823  0.789
  0      0.146   0.866  0.818  0.792
  0      0.151   0.865  0.818  0.792
  0      0.156   0.866  0.821  0.789
  0      0.161   0.866  0.819  0.79 
  0      0.166   0.866  0.825  0.788
  0      0.171   0.867  0.825  0.788
  0      0.176   0.866  0.821  0.792
  0      0.181   0.865  0.823  0.791
  0      0.185   0.866  0.826  0.788
  0      0.19    0.865  0.825  0.788
  0      0.195   0.866  0.821  0.791
  0      0.2     0.865  0.823  0.789
  0.1    0.01    0.866  0.809  0.797
  0.1    0.0149  0.874  0.823  0.798
  0.1    0.0197  0.881  0.826  0.797
  0.1    0.0246  0.886  0.828  0.803
  0.1    0.0295  0.89   0.835  0.803
  0.1    0.0344  0.892  0.839  0.81 
  0.1    0.0392  0.895  0.84   0.811
  0.1    0.0441  0.898  0.851  0.809
  0.1    0.049   0.9    0.851  0.809
  0.1    0.0538  0.9    0.853  0.814
  0.1    0.0587  0.902  0.858  0.805
  0.1    0.0636  0.903  0.86   0.809
  0.1    0.0685  0.904  0.868  0.803
  0.1    0.0733  0.906  0.874  0.799
  0.1    0.0782  0.906  0.87   0.802
  0.1    0.0831  0.906  0.872  0.801
  0.1    0.0879  0.907  0.877  0.8  
  0.1    0.0928  0.908  0.877  0.794
  0.1    0.0977  0.908  0.879  0.795
  0.1    0.103   0.907  0.877  0.795
  0.1    0.107   0.907  0.881  0.792
  0.1    0.112   0.907  0.881  0.797
  0.1    0.117   0.907  0.884  0.795
  0.1    0.122   0.908  0.886  0.791
  0.1    0.127   0.907  0.882  0.791
  0.1    0.132   0.909  0.884  0.79 
  0.1    0.137   0.908  0.886  0.789
  0.1    0.142   0.908  0.884  0.786
  0.1    0.146   0.909  0.886  0.784
  0.1    0.151   0.908  0.881  0.787
  0.1    0.156   0.908  0.881  0.785
  0.1    0.161   0.909  0.884  0.787
  0.1    0.166   0.908  0.882  0.787
  0.1    0.171   0.91   0.889  0.785
  0.1    0.176   0.91   0.889  0.784
  0.1    0.181   0.91   0.889  0.785
  0.1    0.185   0.909  0.886  0.788
  0.1    0.19    0.91   0.893  0.778
  0.1    0.195   0.909  0.889  0.784
  0.1    0.2     0.909  0.891  0.781
  0.2    0.01    0.878  0.83   0.8  
  0.2    0.0149  0.887  0.835  0.803
  0.2    0.0197  0.891  0.839  0.804
  0.2    0.0246  0.896  0.849  0.802
  0.2    0.0295  0.899  0.853  0.805
  0.2    0.0344  0.902  0.858  0.8  
  0.2    0.0392  0.902  0.86   0.798
  0.2    0.0441  0.903  0.874  0.794
  0.2    0.049   0.904  0.879  0.802
  0.2    0.0538  0.904  0.879  0.794
  0.2    0.0587  0.905  0.881  0.797
  0.2    0.0636  0.904  0.881  0.8  
  0.2    0.0685  0.905  0.888  0.793
  0.2    0.0733  0.907  0.888  0.793
  0.2    0.0782  0.905  0.886  0.792
  0.2    0.0831  0.906  0.884  0.793
  0.2    0.0879  0.907  0.886  0.788
  0.2    0.0928  0.905  0.882  0.789
  0.2    0.0977  0.906  0.881  0.791
  0.2    0.103   0.906  0.888  0.777
  0.2    0.107   0.907  0.889  0.778
  0.2    0.112   0.906  0.884  0.774
  0.2    0.117   0.905  0.882  0.777
  0.2    0.122   0.905  0.881  0.779
  0.2    0.127   0.905  0.879  0.778
  0.2    0.132   0.905  0.884  0.772
  0.2    0.137   0.905  0.884  0.77 
  0.2    0.142   0.904  0.877  0.779
  0.2    0.146   0.904  0.879  0.773
  0.2    0.151   0.905  0.884  0.77 
  0.2    0.156   0.904  0.879  0.778
  0.2    0.161   0.905  0.886  0.768
  0.2    0.166   0.905  0.898  0.761
  0.2    0.171   0.904  0.891  0.766
  0.2    0.176   0.904  0.884  0.775
  0.2    0.181   0.903  0.875  0.772
  0.2    0.185   0.905  0.898  0.759
  0.2    0.19    0.904  0.886  0.764
  0.2    0.195   0.903  0.879  0.772
  0.2    0.2     0.903  0.888  0.765
  0.4    0.01    0.887  0.84   0.798
  0.4    0.0149  0.893  0.853  0.796
  0.4    0.0197  0.896  0.858  0.795
  0.4    0.0246  0.897  0.863  0.796
  0.4    0.0295  0.897  0.87   0.793
  0.4    0.0344  0.897  0.875  0.786
  0.4    0.0392  0.897  0.868  0.799
  0.4    0.0441  0.898  0.875  0.793
  0.4    0.049   0.898  0.874  0.79 
  0.4    0.0538  0.898  0.874  0.794
  0.4    0.0587  0.897  0.874  0.78 
  0.4    0.0636  0.897  0.875  0.778
  0.4    0.0685  0.9    0.881  0.766
  0.4    0.0733  0.898  0.879  0.767
  0.4    0.0782  0.899  0.882  0.76 
  0.4    0.0831  0.9    0.879  0.765
  0.4    0.0879  0.899  0.877  0.765
  0.4    0.0928  0.902  0.888  0.758
  0.4    0.0977  0.902  0.888  0.756
  0.4    0.103   0.901  0.882  0.765
  0.4    0.107   0.902  0.886  0.769
  0.4    0.112   0.902  0.886  0.764
  0.4    0.117   0.904  0.9    0.757
  0.4    0.122   0.904  0.9    0.749
  0.4    0.127   0.903  0.902  0.748
  0.4    0.132   0.903  0.904  0.743
  0.4    0.137   0.901  0.893  0.747
  0.4    0.142   0.903  0.9    0.747
  0.4    0.146   0.9    0.914  0.719
  0.4    0.151   0.902  0.926  0.708
  0.4    0.156   0.901  0.944  0.699
  0.4    0.161   0.896  0.902  0.716
  0.4    0.166   0.897  0.912  0.716
  0.4    0.171   0.901  0.935  0.705
  0.4    0.176   0.898  0.954  0.688
  0.4    0.181   0.894  0.951  0.686
  0.4    0.185   0.891  0.919  0.7  
  0.4    0.19    0.877  0.891  0.693
  0.4    0.195   0.877  0.926  0.676
  0.4    0.2     0.881  0.94   0.674
  0.6    0.01    0.889  0.842  0.803
  0.6    0.0149  0.892  0.846  0.8  
  0.6    0.0197  0.892  0.863  0.793
  0.6    0.0246  0.893  0.868  0.787
  0.6    0.0295  0.893  0.865  0.785
  0.6    0.0344  0.894  0.875  0.78 
  0.6    0.0392  0.893  0.874  0.777
  0.6    0.0441  0.894  0.872  0.779
  0.6    0.049   0.893  0.865  0.775
  0.6    0.0538  0.897  0.879  0.767
  0.6    0.0587  0.897  0.875  0.765
  0.6    0.0636  0.9    0.879  0.76 
  0.6    0.0685  0.899  0.879  0.764
  0.6    0.0733  0.901  0.889  0.756
  0.6    0.0782  0.902  0.889  0.756
  0.6    0.0831  0.901  0.895  0.747
  0.6    0.0879  0.903  0.907  0.737
  0.6    0.0928  0.9    0.896  0.744
  0.6    0.0977  0.899  0.898  0.739
  0.6    0.103   0.902  0.918  0.721
  0.6    0.107   0.901  0.944  0.7  
  0.6    0.112   0.902  0.93   0.708
  0.6    0.117   0.89   0.909  0.702
  0.6    0.122   0.894  0.939  0.693
  0.6    0.127   0.888  0.968  0.669
  0.6    0.132   0.876  0.9    0.688
  0.6    0.137   0.876  0.916  0.679
  0.6    0.142   0.869  0.981  0.652
  0.6    0.146   0.87   0.977  0.656
  0.6    0.151   0.868  0.991  0.643
  0.6    0.156   0.868  0.984  0.648
  0.6    0.161   0.869  0.988  0.644
  0.6    0.166   0.872  0.991  0.638
  0.6    0.171   0.869  0.991  0.638
  0.6    0.176   0.869  0.991  0.638
  0.6    0.181   0.872  0.991  0.638
  0.6    0.185   0.823  0.991  0.638
  0.6    0.19    0.823  0.991  0.638
  0.6    0.195   0.823  0.991  0.638
  0.6    0.2     0.823  0.991  0.638
  0.8    0.01    0.89   0.842  0.801
  0.8    0.0149  0.89   0.858  0.795
  0.8    0.0197  0.888  0.861  0.784
  0.8    0.0246  0.89   0.867  0.777
  0.8    0.0295  0.89   0.87   0.773
  0.8    0.0344  0.892  0.867  0.775
  0.8    0.0392  0.892  0.87   0.766
  0.8    0.0441  0.895  0.86   0.768
  0.8    0.049   0.896  0.868  0.767
  0.8    0.0538  0.898  0.884  0.76 
  0.8    0.0587  0.899  0.882  0.76 
  0.8    0.0636  0.898  0.872  0.759
  0.8    0.0685  0.901  0.904  0.74 
  0.8    0.0733  0.902  0.918  0.723
  0.8    0.0782  0.897  0.898  0.72 
  0.8    0.0831  0.901  0.937  0.706
  0.8    0.0879  0.897  0.953  0.683
  0.8    0.0928  0.892  0.914  0.702
  0.8    0.0977  0.877  0.904  0.688
  0.8    0.103   0.881  0.954  0.671
  0.8    0.107   0.868  0.981  0.652
  0.8    0.112   0.868  0.974  0.656
  0.8    0.117   0.868  0.986  0.646
  0.8    0.122   0.868  0.982  0.648
  0.8    0.127   0.872  0.991  0.638
  0.8    0.132   0.868  0.991  0.638
  0.8    0.137   0.823  0.991  0.638
  0.8    0.142   0.823  0.991  0.638
  0.8    0.146   0.823  0.991  0.638
  0.8    0.151   0.823  0.991  0.638
  0.8    0.156   0.823  0.991  0.638
  0.8    0.161   0.823  0.991  0.638
  0.8    0.166   0.815  0.991  0.638
  0.8    0.171   0.815  0.991  0.638
  0.8    0.176   0.815  0.991  0.638
  0.8    0.181   0.815  0.991  0.638
  0.8    0.185   0.815  0.991  0.638
  0.8    0.19    0.815  0.991  0.638
  0.8    0.195   0.815  0.991  0.638
  0.8    0.2     0.815  0.991  0.638
  1      0.01    0.889  0.854  0.799
  1      0.0149  0.888  0.858  0.791
  1      0.0197  0.887  0.858  0.774
  1      0.0246  0.887  0.854  0.775
  1      0.0295  0.888  0.851  0.775
  1      0.0344  0.891  0.861  0.772
  1      0.0392  0.897  0.881  0.761
  1      0.0441  0.897  0.882  0.761
  1      0.049   0.898  0.889  0.751
  1      0.0538  0.9    0.891  0.752
  1      0.0587  0.899  0.904  0.737
  1      0.0636  0.901  0.926  0.708
  1      0.0685  0.897  0.949  0.688
  1      0.0733  0.898  0.94   0.693
  1      0.0782  0.891  0.965  0.671
  1      0.0831  0.868  0.949  0.666
  1      0.0879  0.868  0.944  0.667
  1      0.0928  0.867  0.991  0.643
  1      0.0977  0.867  0.991  0.643
  1      0.103   0.872  0.991  0.638
  1      0.107   0.867  0.991  0.638
  1      0.112   0.823  0.991  0.638
  1      0.117   0.823  0.991  0.638
  1      0.122   0.823  0.991  0.638
  1      0.127   0.823  0.991  0.638
  1      0.132   0.815  0.991  0.638
  1      0.137   0.815  0.991  0.638
  1      0.142   0.815  0.991  0.638
  1      0.146   0.815  0.991  0.638
  1      0.151   0.815  0.991  0.638
  1      0.156   0.815  0.991  0.638
  1      0.161   0.815  0.991  0.638
  1      0.166   0.815  0.991  0.638
  1      0.171   0.815  0.991  0.638
  1      0.176   0.815  0.991  0.638
  1      0.181   0.815  0.991  0.638
  1      0.185   0.815  0.991  0.638
  1      0.19    0.815  0.991  0.638
  1      0.195   0.815  0      1    
  1      0.2     0.815  0      1    

ROC was used to select the optimal model using  the largest value.
The final values used for the model were alpha = 0.1 and lambda = 0.176. 
> 
> glmnet2008 <- merge(glmnFit$pred,  glmnFit$bestTune)
> glmnetCM <- confusionMatrix(glmnFit, norm = "none")
> glmnetCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          507          213
  unsuccessful         63          774
                                          
               Accuracy : 0.8227          
                 95% CI : (0.8028, 0.8414)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6382          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.8895          
            Specificity : 0.7842          
         Pos Pred Value : 0.7042          
         Neg Pred Value : 0.9247          
             Prevalence : 0.3661          
         Detection Rate : 0.3256          
   Detection Prevalence : 0.4624          
      Balanced Accuracy : 0.8368          
                                          
       'Positive' Class : successful      
                                          

> 
> glmnetRoc <- roc(response = glmnet2008$obs,
+                  predictor = glmnet2008$successful,
+                  levels = rev(levels(glmnet2008$obs)))
> 
> glmnFit0 <- glmnFit
> glmnFit0$results$lambda <- format(round(glmnFit0$results$lambda, 3))
> 
> glmnPlot <- plot(glmnFit0,
+                  plotType = "level",
+                  cuts = 15,
+                  scales = list(x = list(rot = 90, cex = .65)))
> 
> update(glmnPlot,
+        ylab = "Mixing Percentage\nRidge <---------> Lasso",
+        sub = "",
+        main = "Area Under the ROC Curve",
+        xlab = "Amount of Regularization")
> 
> plot(plsRoc2, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = plsFit2$pred$obs, predictor = plsFit2$pred$successful,     levels = rev(levels(plsFit2$pred$obs)))

Data: plsFit2$pred$successful in 987 controls (plsFit2$pred$obs unsuccessful) < 570 cases (plsFit2$pred$obs successful).
Area under the curve: 0.895
> plot(ldaRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = ldaFit$pred$obs, predictor = ldaFit$pred$successful,     levels = rev(levels(ldaFit$pred$obs)))

Data: ldaFit$pred$successful in 987 controls (ldaFit$pred$obs unsuccessful) < 570 cases (ldaFit$pred$obs successful).
Area under the curve: 0.8892
> plot(lrRoc, type = "s", col = rgb(.2, .2, .2, .2), add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = lrFit$pred$obs, predictor = lrFit$pred$successful,     levels = rev(levels(lrFit$pred$obs)))

Data: lrFit$pred$successful in 987 controls (lrFit$pred$obs unsuccessful) < 570 cases (lrFit$pred$obs successful).
Area under the curve: 0.8715
> plot(glmnetRoc, type = "s", add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = glmnet2008$obs, predictor = glmnet2008$successful,     levels = rev(levels(glmnet2008$obs)))

Data: glmnet2008$successful in 987 controls (glmnet2008$obs unsuccessful) < 570 cases (glmnet2008$obs successful).
Area under the curve: 0.91
> 
> ## Sparse logistic regression
> 
> set.seed(476)
> spLDAFit <- train(x = training[,fullSet], 
+                   y = training$Class,
+                   "sparseLDA",
+                   tuneGrid = expand.grid(lambda = c(.1),
+                                          NumVars = c(1:20, 50, 75, 100, 250, 500, 750, 1000)),
+                   preProc = c("center", "scale"),
+                   metric = "ROC",
+                   trControl = ctrl)
Loading required package: sparseLDA
Loading required package: lars
Loaded lars 1.2

Loading required package: elasticnet
Loading required package: mda
> spLDAFit
Sparse Linear Discriminant Analysis 

8190 samples
1071 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  NumVars  ROC    Sens   Spec 
  1        0.815  0.991  0.638
  2        0.823  0.991  0.638
  3        0.865  0.991  0.638
  4        0.868  0.96   0.663
  5        0.886  0.961  0.67 
  6        0.901  0.921  0.719
  7        0.899  0.891  0.751
  8        0.898  0.888  0.754
  9        0.897  0.886  0.751
  10       0.897  0.886  0.751
  11       0.897  0.886  0.751
  12       0.897  0.886  0.754
  13       0.897  0.886  0.755
  14       0.897  0.886  0.755
  15       0.897  0.886  0.755
  16       0.897  0.886  0.756
  17       0.897  0.884  0.764
  18       0.897  0.884  0.765
  19       0.897  0.882  0.766
  20       0.897  0.882  0.765
  50       0.899  0.877  0.78 
  75       0.9    0.877  0.785
  100      0.901  0.875  0.787
  250      0.9    0.856  0.797
  500      0.89   0.837  0.8  
  750      0.878  0.818  0.799
  1000     0.864  0.802  0.798

Tuning parameter 'lambda' was held constant at a value of 0.1
ROC was used to select the optimal model using  the largest value.
The final values used for the model were NumVars = 6 and lambda = 0.1. 
> 
> spLDA2008 <- merge(spLDAFit$pred,  spLDAFit$bestTune)
> spLDACM <- confusionMatrix(spLDAFit, norm = "none")
> spLDACM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          525          277
  unsuccessful         45          710
                                          
               Accuracy : 0.7932          
                 95% CI : (0.7722, 0.8131)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5897          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9211          
            Specificity : 0.7194          
         Pos Pred Value : 0.6546          
         Neg Pred Value : 0.9404          
             Prevalence : 0.3661          
         Detection Rate : 0.3372          
   Detection Prevalence : 0.5151          
      Balanced Accuracy : 0.8202          
                                          
       'Positive' Class : successful      
                                          

> 
> spLDARoc <- roc(response = spLDA2008$obs,
+                 predictor = spLDA2008$successful,
+                 levels = rev(levels(spLDA2008$obs)))
> 
> update(plot(spLDAFit, scales = list(x = list(log = 10))),
+        ylab = "ROC AUC (2008 Hold-Out Data)")
> 
> plot(plsRoc2, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = plsFit2$pred$obs, predictor = plsFit2$pred$successful,     levels = rev(levels(plsFit2$pred$obs)))

Data: plsFit2$pred$successful in 987 controls (plsFit2$pred$obs unsuccessful) < 570 cases (plsFit2$pred$obs successful).
Area under the curve: 0.895
> plot(glmnetRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = glmnet2008$obs, predictor = glmnet2008$successful,     levels = rev(levels(glmnet2008$obs)))

Data: glmnet2008$successful in 987 controls (glmnet2008$obs unsuccessful) < 570 cases (glmnet2008$obs successful).
Area under the curve: 0.91
> plot(ldaRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = ldaFit$pred$obs, predictor = ldaFit$pred$successful,     levels = rev(levels(ldaFit$pred$obs)))

Data: ldaFit$pred$successful in 987 controls (ldaFit$pred$obs unsuccessful) < 570 cases (ldaFit$pred$obs successful).
Area under the curve: 0.8892
> plot(lrRoc, type = "s", col = rgb(.2, .2, .2, .2), add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = lrFit$pred$obs, predictor = lrFit$pred$successful,     levels = rev(levels(lrFit$pred$obs)))

Data: lrFit$pred$successful in 987 controls (lrFit$pred$obs unsuccessful) < 570 cases (lrFit$pred$obs successful).
Area under the curve: 0.8715
> plot(spLDARoc, type = "s", add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = spLDA2008$obs, predictor = spLDA2008$successful,     levels = rev(levels(spLDA2008$obs)))

Data: spLDA2008$successful in 987 controls (spLDA2008$obs unsuccessful) < 570 cases (spLDA2008$obs successful).
Area under the curve: 0.9015
> 
> ################################################################################
> ### Section 12.6 Nearest Shrunken Centroids
> 
> set.seed(476)
> nscFit <- train(x = training[,fullSet], 
+                 y = training$Class,
+                 method = "pam",
+                 preProc = c("center", "scale"),
+                 tuneGrid = data.frame(threshold = seq(0, 25, length = 30)),
+                 metric = "ROC",
+                 trControl = ctrl)
Loading required package: pamr
Loading required package: cluster
Loading required package: survival
Loading required package: splines

Attaching package: 'survival'

The following object is masked from 'package:caret':

    cluster

11> nscFit
Nearest Shrunken Centroids 

8190 samples
1071 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  threshold  ROC    Sens   Spec 
  0          0.827  0.784  0.733
  0.862      0.864  0.842  0.73 
  1.72       0.871  0.865  0.736
  2.59       0.873  0.861  0.744
  3.45       0.873  0.849  0.752
  4.31       0.868  0.823  0.754
  5.17       0.866  0.821  0.753
  6.03       0.862  0.856  0.732
  6.9        0.852  0.844  0.721
  7.76       0.857  0.935  0.675
  8.62       0.872  0.991  0.638
  9.48       0.832  0.991  0.638
  10.3       0.823  0.991  0.638
  11.2       0.815  0.991  0.638
  12.1       0.815  0.991  0.638
  12.9       0.815  0.991  0.638
  13.8       0.815  0      1    
  14.7       0.815  0      1    
  15.5       0.815  0      1    
  16.4       0.815  0      1    
  17.2       0.815  0      1    
  18.1       0.5    0      1    
  19         0.5    0      1    
  19.8       0.5    0      1    
  20.7       0.5    0      1    
  21.6       0.5    0      1    
  22.4       0.5    0      1    
  23.3       0.5    0      1    
  24.1       0.5    0      1    
  25         0.5    0      1    

ROC was used to select the optimal model using  the largest value.
The final value used for the model was threshold = 2.59. 
> 
> nsc2008 <- merge(nscFit$pred,  nscFit$bestTune)
> nscCM <- confusionMatrix(nscFit, norm = "none")
> nscCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          491          253
  unsuccessful         79          734
                                          
               Accuracy : 0.7868          
                 95% CI : (0.7656, 0.8069)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5684          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.8614          
            Specificity : 0.7437          
         Pos Pred Value : 0.6599          
         Neg Pred Value : 0.9028          
             Prevalence : 0.3661          
         Detection Rate : 0.3154          
   Detection Prevalence : 0.4778          
      Balanced Accuracy : 0.8025          
                                          
       'Positive' Class : successful      
                                          

> nscRoc <- roc(response = nsc2008$obs,
+               predictor = nsc2008$successful,
+               levels = rev(levels(nsc2008$obs)))
> update(plot(nscFit), ylab = "ROC AUC (2008 Hold-Out Data)")
> 
> 
> plot(plsRoc2, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = plsFit2$pred$obs, predictor = plsFit2$pred$successful,     levels = rev(levels(plsFit2$pred$obs)))

Data: plsFit2$pred$successful in 987 controls (plsFit2$pred$obs unsuccessful) < 570 cases (plsFit2$pred$obs successful).
Area under the curve: 0.895
> plot(glmnetRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = glmnet2008$obs, predictor = glmnet2008$successful,     levels = rev(levels(glmnet2008$obs)))

Data: glmnet2008$successful in 987 controls (glmnet2008$obs unsuccessful) < 570 cases (glmnet2008$obs successful).
Area under the curve: 0.91
> plot(ldaRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = ldaFit$pred$obs, predictor = ldaFit$pred$successful,     levels = rev(levels(ldaFit$pred$obs)))

Data: ldaFit$pred$successful in 987 controls (ldaFit$pred$obs unsuccessful) < 570 cases (ldaFit$pred$obs successful).
Area under the curve: 0.8892
> plot(lrRoc, type = "s", col = rgb(.2, .2, .2, .2), add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = lrFit$pred$obs, predictor = lrFit$pred$successful,     levels = rev(levels(lrFit$pred$obs)))

Data: lrFit$pred$successful in 987 controls (lrFit$pred$obs unsuccessful) < 570 cases (lrFit$pred$obs successful).
Area under the curve: 0.8715
> plot(spLDARoc, type = "s", col = rgb(.2, .2, .2, .2), add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = spLDA2008$obs, predictor = spLDA2008$successful,     levels = rev(levels(spLDA2008$obs)))

Data: spLDA2008$successful in 987 controls (spLDA2008$obs unsuccessful) < 570 cases (spLDA2008$obs successful).
Area under the curve: 0.9015
> plot(nscRoc, type = "s", add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = nsc2008$obs, predictor = nsc2008$successful,     levels = rev(levels(nsc2008$obs)))

Data: nsc2008$successful in 987 controls (nsc2008$obs unsuccessful) < 570 cases (nsc2008$obs successful).
Area under the curve: 0.8733
> 
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] C

attached base packages:
[1] splines   parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] pamr_1.54       survival_2.37-4 cluster_1.14.4  sparseLDA_0.1-6
 [5] mda_0.4-2       elasticnet_1.1  lars_1.2        glmnet_1.9-3   
 [9] Matrix_1.0-12   klaR_0.6-8      pls_2.3-0       MASS_7.3-26    
[13] e1071_1.6-1     class_7.3-7     pROC_1.5.4      kernlab_0.9-18 
[17] reshape2_1.2.2  plyr_1.8        doMC_1.3.0      iterators_1.0.6
[21] foreach_1.4.0   caret_6.0-22    ggplot2_0.9.3.1 lattice_0.20-15

loaded via a namespace (and not attached):
 [1] RColorBrewer_1.0-5 car_2.0-17         codetools_0.2-8    colorspace_1.2-2  
 [5] compiler_3.0.1     dichromat_2.0-0    digest_0.6.3       grid_3.0.1        
 [9] gtable_0.1.2       labeling_0.1       munsell_0.4        proto_0.3-10      
[13] scales_0.2.3       stringr_0.6.2     
> 
> q("no")
> proc.time()
      user     system    elapsed 
376332.996   8337.928  35694.682 
In [71]:
%%R -w 600 -h 600

## runChapterScript(12)

##        user     system    elapsed 
##  376332.996   8337.928  35694.682
NULL
In [86]:
%%R

showChapterScript(13)
NULL
In [73]:
%%R

showChapterOutput(13)
R Information
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com) 
> ###
> ### Chapter 13 Non-Linear Classification Models
> ###
> ### Required packages: AppliedPredictiveModeling, caret, doMC (optional) 
> ###                    kernlab, klaR, lattice, latticeExtra, MASS, mda, nnet,
> ###                    pROC
> ###
> ### Data used: The grant application data. See the file 'CreateGrantData.R'
> ###
> ### Notes: 
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be 
> ### syntax differences that occur over time as packages evolve. These files 
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in 
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
> 
> ################################################################################
> ### Section 13.1 Nonlinear Discriminant Analysis
> 
> 
> load("grantData.RData")
> 
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> 
> ### Optional: parallel processing can be used via the 'do' packages,
> ### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
> ### up the computations.
>  
> ### WARNING: Be aware of how much memory is needed to parallel
> ### process. It can very quickly overwhelm the available hardware. We
> ### estimate the memory usage (VSIZE = total memory size) to be 
> ### 2700M/core.
> 
> library(doMC)
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
> registerDoMC(12)
> 
> ## This control object will be used across multiple models so that the
> ## data splitting is consistent
> 
> ctrl <- trainControl(method = "LGOCV",
+                      summaryFunction = twoClassSummary,
+                      classProbs = TRUE,
+                      index = list(TrainSet = pre2008),
+                      savePredictions = TRUE)
> 
> set.seed(476)
> mdaFit <- train(x = training[,reducedSet], 
+                 y = training$Class,
+                 method = "mda",
+                 metric = "ROC",
+                 tries = 40,
+                 tuneGrid = expand.grid(subclasses = 1:8),
+                 trControl = ctrl)
Loading required package: mda
Loading required package: class
Loading required package: pROC
Loading required package: plyr
Type 'citation("pROC")' for a citation.

Attaching package: ‘pROC’

The following object is masked from ‘package:stats’:

    cov, smooth, var

> mdaFit
Mixture Discriminant Analysis 

8190 samples
 252 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  subclasses  ROC    Sens   Spec 
  1           0.887  0.811  0.822
  2           0.865  0.789  0.813
  3           0.831  0.835  0.726
  4           0.852  0.732  0.82 
  5           0.842  0.733  0.797
  6           0.822  0.733  0.782
  7           0.836  0.823  0.734
  8           0.791  0.649  0.851

ROC was used to select the optimal model using  the largest value.
The final value used for the model was subclasses = 1. 
> 
> mdaFit$results <- mdaFit$results[!is.na(mdaFit$results$ROC),]                
> mdaFit$pred <- merge(mdaFit$pred,  mdaFit$bestTune)
> mdaCM <- confusionMatrix(mdaFit, norm = "none")
> mdaCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          462          176
  unsuccessful        108          811
                                          
               Accuracy : 0.8176          
                 95% CI : (0.7975, 0.8365)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6167          
 Mcnemar's Test P-Value : 7.017e-05       
                                          
            Sensitivity : 0.8105          
            Specificity : 0.8217          
         Pos Pred Value : 0.7241          
         Neg Pred Value : 0.8825          
             Prevalence : 0.3661          
         Detection Rate : 0.2967          
   Detection Prevalence : 0.4098          
      Balanced Accuracy : 0.8161          
                                          
       'Positive' Class : successful      
                                          

> 
> mdaRoc <- roc(response = mdaFit$pred$obs,
+               predictor = mdaFit$pred$successful,
+               levels = rev(levels(mdaFit$pred$obs)))
> mdaRoc

Call:
roc.default(response = mdaFit$pred$obs, predictor = mdaFit$pred$successful,     levels = rev(levels(mdaFit$pred$obs)))

Data: mdaFit$pred$successful in 987 controls (mdaFit$pred$obs unsuccessful) < 570 cases (mdaFit$pred$obs successful).
Area under the curve: 0.8874
> 
> update(plot(mdaFit,
+             ylab = "ROC AUC (2008 Hold-Out Data)"))
> 
> ################################################################################
> ### Section 13.2 Neural Networks
> 
> nnetGrid <- expand.grid(size = 1:10, decay = c(0, .1, 1, 2))
> maxSize <- max(nnetGrid$size)
> 
> 
> ## Four different models are evaluate based on the data pre-processing and 
> ## whethera single or multiple models are used
> 
> set.seed(476)
> nnetFit <- train(x = training[,reducedSet], 
+                  y = training$Class,
+                  method = "nnet",
+                  metric = "ROC",
+                  preProc = c("center", "scale"),
+                  tuneGrid = nnetGrid,
+                  trace = FALSE,
+                  maxit = 2000,
+                  MaxNWts = 1*(maxSize * (length(reducedSet) + 1) + maxSize + 1),
+                  trControl = ctrl)
Loading required package: nnet
> nnetFit
Neural Network 

8190 samples
 252 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  size  decay  ROC    Sens   Spec 
  1     0      0.778  0.765  0.791
  1     0.1    0.845  0.793  0.794
  1     1      0.844  0.795  0.79 
  1     2      0.853  0.782  0.811
  2     0      0.804  0.811  0.753
  2     0.1    0.846  0.807  0.806
  2     1      0.86   0.73   0.841
  2     2      0.864  0.758  0.834
  3     0      0.841  0.805  0.757
  3     0.1    0.822  0.786  0.728
  3     1      0.857  0.73   0.833
  3     2      0.859  0.747  0.81 
  4     0      0.828  0.795  0.754
  4     0.1    0.854  0.74   0.814
  4     1      0.869  0.802  0.796
  4     2      0.864  0.779  0.785
  5     0      0.819  0.767  0.719
  5     0.1    0.843  0.786  0.787
  5     1      0.845  0.716  0.817
  5     2      0.851  0.728  0.829
  6     0      0.844  0.728  0.806
  6     0.1    0.8    0.693  0.775
  6     1      0.848  0.782  0.778
  6     2      0.869  0.777  0.82 
  7     0      0.833  0.807  0.757
  7     0.1    0.806  0.728  0.768
  7     1      0.831  0.746  0.777
  7     2      0.863  0.758  0.822
  8     0      0.833  0.761  0.784
  8     0.1    0.847  0.751  0.78 
  8     1      0.857  0.753  0.803
  8     2      0.866  0.77   0.814
  9     0      0.848  0.784  0.789
  9     0.1    0.836  0.719  0.798
  9     1      0.843  0.753  0.781
  9     2      0.854  0.746  0.803
  10    0      0.806  0.707  0.779
  10    0.1    0.82   0.726  0.76 
  10    1      0.846  0.73   0.807
  10    2      0.863  0.749  0.817

ROC was used to select the optimal model using  the largest value.
The final values used for the model were size = 4 and decay = 1. 
> 
> set.seed(476)
> nnetFit2 <- train(x = training[,reducedSet], 
+                   y = training$Class,
+                   method = "nnet",
+                   metric = "ROC",
+                   preProc = c("center", "scale", "spatialSign"),
+                   tuneGrid = nnetGrid,
+                   trace = FALSE,
+                   maxit = 2000,
+                   MaxNWts = 1*(maxSize * (length(reducedSet) + 1) + maxSize + 1),
+                   trControl = ctrl)
> nnetFit2
Neural Network 

8190 samples
 252 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled, spatial sign transformation 
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  size  decay  ROC    Sens   Spec 
  1     0      0.782  0.782  0.78 
  1     0.1    0.863  0.784  0.809
  1     1      0.874  0.807  0.805
  1     2      0.88   0.804  0.807
  2     0      0.776  0.804  0.711
  2     0.1    0.892  0.767  0.861
  2     1      0.897  0.804  0.839
  2     2      0.881  0.805  0.811
  3     0      0.841  0.653  0.876
  3     0.1    0.887  0.737  0.851
  3     1      0.898  0.805  0.851
  3     2      0.884  0.805  0.812
  4     0      0.786  0.756  0.715
  4     0.1    0.871  0.716  0.829
  4     1      0.899  0.793  0.84 
  4     2      0.883  0.804  0.812
  5     0      0.862  0.867  0.705
  5     0.1    0.858  0.718  0.836
  5     1      0.902  0.788  0.857
  5     2      0.883  0.804  0.812
  6     0      0.808  0.691  0.796
  6     0.1    0.859  0.712  0.844
  6     1      0.896  0.795  0.842
  6     2      0.883  0.804  0.812
  7     0      0.807  0.732  0.782
  7     0.1    0.843  0.693  0.829
  7     1      0.902  0.789  0.857
  7     2      0.883  0.804  0.813
  8     0      0.73   0.661  0.795
  8     0.1    0.858  0.681  0.834
  8     1      0.903  0.791  0.853
  8     2      0.883  0.804  0.813
  9     0      0.857  0.779  0.804
  9     0.1    0.87   0.739  0.833
  9     1      0.902  0.788  0.857
  9     2      0.883  0.804  0.813
  10    0      0.788  0.684  0.823
  10    0.1    0.876  0.721  0.845
  10    1      0.897  0.796  0.842
  10    2      0.883  0.804  0.813

ROC was used to select the optimal model using  the largest value.
The final values used for the model were size = 8 and decay = 1. 
> 
> nnetGrid$bag <- FALSE
> 
> set.seed(476)
> nnetFit3 <- train(x = training[,reducedSet], 
+                   y = training$Class,
+                   method = "avNNet",
+                   metric = "ROC",
+                   preProc = c("center", "scale"),
+                   tuneGrid = nnetGrid,
+                   repeats = 10,
+                   trace = FALSE,
+                   maxit = 2000,
+                   MaxNWts = 10*(maxSize * (length(reducedSet) + 1) + maxSize + 1),
+                   allowParallel = FALSE, ## this will cause to many workers to be launched.
+                   trControl = ctrl)
> nnetFit3
Model Averaged Neural Network 

8190 samples
 252 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  size  decay  ROC    Sens   Spec 
  1     0      0.884  0.867  0.762
  1     0.1    0.868  0.779  0.812
  1     1      0.847  0.774  0.812
  1     2      0.849  0.777  0.81 
  2     0      0.892  0.825  0.791
  2     0.1    0.886  0.784  0.854
  2     1      0.895  0.788  0.844
  2     2      0.895  0.796  0.845
  3     0      0.887  0.826  0.793
  3     0.1    0.882  0.8    0.825
  3     1      0.89   0.795  0.842
  3     2      0.899  0.798  0.838
  4     0      0.883  0.821  0.805
  4     0.1    0.887  0.8    0.821
  4     1      0.899  0.781  0.853
  4     2      0.902  0.798  0.86 
  5     0      0.886  0.83   0.79 
  5     0.1    0.874  0.788  0.824
  5     1      0.901  0.8    0.844
  5     2      0.9    0.8    0.851
  6     0      0.885  0.819  0.807
  6     0.1    0.882  0.789  0.827
  6     1      0.893  0.786  0.854
  6     2      0.9    0.8    0.849
  7     0      0.881  0.832  0.761
  7     0.1    0.883  0.791  0.821
  7     1      0.898  0.811  0.834
  7     2      0.899  0.807  0.859
  8     0      0.889  0.818  0.793
  8     0.1    0.88   0.786  0.83 
  8     1      0.891  0.8    0.823
  8     2      0.901  0.791  0.845
  9     0      0.887  0.8    0.806
  9     0.1    0.889  0.786  0.817
  9     1      0.894  0.791  0.848
  9     2      0.9    0.802  0.836
  10    0      0.883  0.811  0.805
  10    0.1    0.881  0.784  0.825
  10    1      0.898  0.793  0.844
  10    2      0.896  0.802  0.839

Tuning parameter 'bag' was held constant at a value of FALSE
ROC was used to select the optimal model using  the largest value.
The final values used for the model were size = 4, decay = 2 and bag = FALSE. 
> 
> set.seed(476)
> nnetFit4 <- train(x = training[,reducedSet], 
+                   y = training$Class,
+                   method = "avNNet",
+                   metric = "ROC",
+                   preProc = c("center", "scale", "spatialSign"),
+                   tuneGrid = nnetGrid,
+                   trace = FALSE,
+                   maxit = 2000,
+                   repeats = 10,
+                   MaxNWts = 10*(maxSize * (length(reducedSet) + 1) + maxSize + 1),
+                   allowParallel = FALSE, 
+                   trControl = ctrl)
> nnetFit4
Model Averaged Neural Network 

8190 samples
 252 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled, spatial sign transformation 
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  size  decay  ROC    Sens   Spec 
  1     0      0.867  0.784  0.8  
  1     0.1    0.857  0.782  0.81 
  1     1      0.874  0.804  0.807
  1     2      0.882  0.795  0.802
  2     0      0.882  0.782  0.845
  2     0.1    0.897  0.754  0.872
  2     1      0.875  0.798  0.811
  2     2      0.881  0.795  0.804
  3     0      0.89   0.796  0.833
  3     0.1    0.907  0.788  0.864
  3     1      0.876  0.795  0.81 
  3     2      0.881  0.795  0.804
  4     0      0.889  0.795  0.838
  4     0.1    0.911  0.782  0.867
  4     1      0.874  0.798  0.809
  4     2      0.881  0.795  0.805
  5     0      0.893  0.786  0.861
  5     0.1    0.909  0.779  0.87 
  5     1      0.875  0.796  0.809
  5     2      0.881  0.795  0.805
  6     0      0.893  0.786  0.848
  6     0.1    0.904  0.754  0.865
  6     1      0.876  0.793  0.81 
  6     2      0.881  0.795  0.805
  7     0      0.89   0.782  0.849
  7     0.1    0.905  0.76   0.866
  7     1      0.881  0.796  0.817
  7     2      0.881  0.795  0.805
  8     0      0.898  0.795  0.856
  8     0.1    0.904  0.756  0.865
  8     1      0.878  0.795  0.813
  8     2      0.881  0.795  0.805
  9     0      0.893  0.782  0.857
  9     0.1    0.902  0.761  0.869
  9     1      0.878  0.795  0.813
  9     2      0.881  0.795  0.805
  10    0      0.895  0.786  0.865
  10    0.1    0.901  0.76   0.858
  10    1      0.878  0.795  0.814
  10    2      0.881  0.795  0.806

Tuning parameter 'bag' was held constant at a value of FALSE
ROC was used to select the optimal model using  the largest value.
The final values used for the model were size = 4, decay = 0.1 and bag = FALSE. 
> 
> nnetFit4$pred <- merge(nnetFit4$pred,  nnetFit4$bestTune)
> nnetCM <- confusionMatrix(nnetFit4, norm = "none")
> nnetCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          446          131
  unsuccessful        124          856
                                          
               Accuracy : 0.8362          
                 95% CI : (0.8169, 0.8543)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.648           
 Mcnemar's Test P-Value : 0.7071          
                                          
            Sensitivity : 0.7825          
            Specificity : 0.8673          
         Pos Pred Value : 0.7730          
         Neg Pred Value : 0.8735          
             Prevalence : 0.3661          
         Detection Rate : 0.2864          
   Detection Prevalence : 0.3706          
      Balanced Accuracy : 0.8249          
                                          
       'Positive' Class : successful      
                                          

> 
> nnetRoc <- roc(response = nnetFit4$pred$obs,
+                predictor = nnetFit4$pred$successful,
+                levels = rev(levels(nnetFit4$pred$obs)))
> 
> 
> nnet1 <- nnetFit$results
> nnet1$Transform <- "No Transformation"
> nnet1$Model <- "Single Model"
> 
> nnet2 <- nnetFit2$results
> nnet2$Transform <- "Spatial Sign"
> nnet2$Model <- "Single Model"
> 
> nnet3 <- nnetFit3$results
> nnet3$Transform <- "No Transformation"
> nnet3$Model <- "Model Averaging"
> nnet3$bag <- NULL
> 
> nnet4 <- nnetFit4$results
> nnet4$Transform <- "Spatial Sign"
> nnet4$Model <- "Model Averaging"
> nnet4$bag <- NULL
> 
> nnetResults <- rbind(nnet1, nnet2, nnet3, nnet4)
> nnetResults$Model <- factor(as.character(nnetResults$Model),
+                             levels = c("Single Model", "Model Averaging"))
> library(latticeExtra)
Loading required package: RColorBrewer

Attaching package: ‘latticeExtra’

The following object is masked from ‘package:ggplot2’:

    layer

> useOuterStrips(
+   xyplot(ROC ~ size|Model*Transform,
+          data = nnetResults,
+          groups = decay,
+          as.table = TRUE,
+          type = c("p", "l", "g"),
+          lty = 1,
+          ylab = "ROC AUC (2008 Hold-Out Data)",
+          xlab = "Number of Hidden Units",
+          auto.key = list(columns = 4, 
+                          title = "Weight Decay", 
+                          cex.title = 1)))
> 
> plot(nnetRoc, type = "s", legacy.axes = TRUE)

Call:
roc.default(response = nnetFit4$pred$obs, predictor = nnetFit4$pred$successful,     levels = rev(levels(nnetFit4$pred$obs)))

Data: nnetFit4$pred$successful in 987 controls (nnetFit4$pred$obs unsuccessful) < 570 cases (nnetFit4$pred$obs successful).
Area under the curve: 0.9111
> 
> ################################################################################
> ### Section 13.3 Flexible Discriminant Analysis
> 
> set.seed(476)
> fdaFit <- train(x = training[,reducedSet], 
+                 y = training$Class,
+                 method = "fda",
+                 metric = "ROC",
+                 tuneGrid = expand.grid(degree = 1, nprune = 2:25),
+                 trControl = ctrl)
Loading required package: earth
Loading required package: leaps
Loading required package: plotmo
Loading required package: plotrix
> fdaFit
Flexible Discriminant Analysis 

8190 samples
 252 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  nprune  ROC    Sens   Spec 
  2       0.815  0.991  0.638
  3       0.809  0.995  0.567
  4       0.86   0.947  0.73 
  5       0.869  0.963  0.728
  6       0.877  0.968  0.727
  7       0.893  0.823  0.806
  8       0.903  0.779  0.851
  9       0.909  0.83   0.841
  10      0.915  0.816  0.853
  11      0.919  0.825  0.859
  12      0.92   0.816  0.865
  13      0.918  0.809  0.865
  14      0.92   0.807  0.865
  15      0.92   0.819  0.861
  16      0.921  0.826  0.858
  17      0.921  0.818  0.863
  18      0.922  0.821  0.86 
  19      0.924  0.825  0.864
  20      0.922  0.825  0.858
  21      0.919  0.816  0.869
  22      0.919  0.811  0.872
  23      0.918  0.811  0.867
  24      0.918  0.809  0.868
  25      0.918  0.809  0.865

Tuning parameter 'degree' was held constant at a value of 1
ROC was used to select the optimal model using  the largest value.
The final values used for the model were degree = 1 and nprune = 19. 
> 
> fdaFit$pred <- merge(fdaFit$pred,  fdaFit$bestTune)
> fdaCM <- confusionMatrix(fdaFit, norm = "none")
> fdaCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          470          134
  unsuccessful        100          853
                                         
               Accuracy : 0.8497         
                 95% CI : (0.831, 0.8671)
    No Information Rate : 0.6339         
    P-Value [Acc > NIR] : < 2e-16        
                                         
                  Kappa : 0.6802         
 Mcnemar's Test P-Value : 0.03098        
                                         
            Sensitivity : 0.8246         
            Specificity : 0.8642         
         Pos Pred Value : 0.7781         
         Neg Pred Value : 0.8951         
             Prevalence : 0.3661         
         Detection Rate : 0.3019         
   Detection Prevalence : 0.3879         
      Balanced Accuracy : 0.8444         
                                         
       'Positive' Class : successful     
                                         

> 
> fdaRoc <- roc(response = fdaFit$pred$obs,
+               predictor = fdaFit$pred$successful,
+               levels = rev(levels(fdaFit$pred$obs)))
> 
> update(plot(fdaFit), ylab = "ROC AUC (2008 Hold-Out Data)")
> 
> plot(nnetRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = nnetFit4$pred$obs, predictor = nnetFit4$pred$successful,     levels = rev(levels(nnetFit4$pred$obs)))

Data: nnetFit4$pred$successful in 987 controls (nnetFit4$pred$obs unsuccessful) < 570 cases (nnetFit4$pred$obs successful).
Area under the curve: 0.9111
> plot(fdaRoc, type = "s", add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = fdaFit$pred$obs, predictor = fdaFit$pred$successful,     levels = rev(levels(fdaFit$pred$obs)))

Data: fdaFit$pred$successful in 987 controls (fdaFit$pred$obs unsuccessful) < 570 cases (fdaFit$pred$obs successful).
Area under the curve: 0.924
> 
> 
> ################################################################################
> ### Section 13.4 Support Vector Machines
> 
> library(kernlab)
> 
> set.seed(201)
> sigmaRangeFull <- sigest(as.matrix(training[,fullSet]))
> svmRGridFull <- expand.grid(sigma =  as.vector(sigmaRangeFull)[1],
+                             C = 2^(-3:4))
> set.seed(476)
> svmRFitFull <- train(x = training[,fullSet], 
+                      y = training$Class,
+                      method = "svmRadial",
+                      metric = "ROC",
+                      preProc = c("center", "scale"),
+                      tuneGrid = svmRGridFull,
+                      trControl = ctrl)
> svmRFitFull
Support Vector Machines with Radial Basis Function Kernel 

8190 samples
1070 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  C      ROC    Sens   Spec 
  0.125  0.781  0.916  0.521
  0.25   0.851  0.861  0.694
  0.5    0.866  0.84   0.755
  1      0.873  0.83   0.774
  2      0.875  0.821  0.791
  4      0.875  0.811  0.803
  8      0.87   0.798  0.799
  16     0.866  0.798  0.81 

Tuning parameter 'sigma' was held constant at a value of 0.0002385724
ROC was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.000239 and C = 2. 
> 
> set.seed(202)
> sigmaRangeReduced <- sigest(as.matrix(training[,reducedSet]))
> svmRGridReduced <- expand.grid(sigma = sigmaRangeReduced[1],
+                                C = 2^(seq(-4, 4)))
> set.seed(476)
> svmRFitReduced <- train(x = training[,reducedSet], 
+                         y = training$Class,
+                         method = "svmRadial",
+                         metric = "ROC",
+                         preProc = c("center", "scale"),
+                         tuneGrid = svmRGridReduced,
+                         trControl = ctrl)
> svmRFitReduced
Support Vector Machines with Radial Basis Function Kernel 

8190 samples
 252 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  C       ROC    Sens   Spec 
  0.0625  0.866  0.916  0.691
  0.125   0.88   0.86   0.758
  0.25    0.89   0.849  0.781
  0.5     0.894  0.83   0.8  
  1       0.895  0.811  0.815
  2       0.891  0.805  0.83 
  4       0.887  0.805  0.822
  8       0.885  0.798  0.821
  16      0.882  0.8    0.82 

Tuning parameter 'sigma' was held constant at a value of 0.001166986
ROC was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.00117 and C = 1. 
> 
> svmPGrid <-  expand.grid(degree = 1:2,
+                          scale = c(0.01, .005),
+                          C = 2^(seq(-6, -2, length = 10)))
> 
> set.seed(476)
> svmPFitFull <- train(x = training[,fullSet], 
+                      y = training$Class,
+                      method = "svmPoly",
+                      metric = "ROC",
+                      preProc = c("center", "scale"),
+                      tuneGrid = svmPGrid,
+                      trControl = ctrl)
> svmPFitFull
Support Vector Machines with Polynomial Kernel 

8190 samples
1070 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  degree  scale  C       ROC    Sens   Spec 
  1       0.005  0.0156  0.856  0.886  0.706
  1       0.005  0.0213  0.861  0.87   0.733
  1       0.005  0.0289  0.863  0.868  0.758
  1       0.005  0.0394  0.867  0.863  0.768
  1       0.005  0.0536  0.87   0.863  0.777
  1       0.005  0.0729  0.872  0.856  0.782
  1       0.005  0.0992  0.872  0.84   0.789
  1       0.005  0.135   0.873  0.825  0.798
  1       0.005  0.184   0.872  0.816  0.798
  1       0.005  0.25    0.872  0.814  0.803
  1       0.01   0.0156  0.864  0.868  0.758
  1       0.01   0.0213  0.868  0.865  0.768
  1       0.01   0.0289  0.87   0.861  0.78 
  1       0.01   0.0394  0.872  0.849  0.784
  1       0.01   0.0536  0.873  0.84   0.79 
  1       0.01   0.0729  0.873  0.825  0.798
  1       0.01   0.0992  0.872  0.814  0.801
  1       0.01   0.135   0.871  0.812  0.802
  1       0.01   0.184   0.868  0.812  0.795
  1       0.01   0.25    0.862  0.791  0.795
  2       0.005  0.0156  0.838  0.812  0.752
  2       0.005  0.0213  0.845  0.816  0.766
  2       0.005  0.0289  0.852  0.819  0.776
  2       0.005  0.0394  0.856  0.819  0.78 
  2       0.005  0.0536  0.86   0.814  0.784
  2       0.005  0.0729  0.865  0.825  0.782
  2       0.005  0.0992  0.866  0.823  0.787
  2       0.005  0.135   0.865  0.816  0.788
  2       0.005  0.184   0.862  0.802  0.789
  2       0.005  0.25    0.86   0.807  0.784
  2       0.01   0.0156  0.845  0.816  0.765
  2       0.01   0.0213  0.851  0.811  0.774
  2       0.01   0.0289  0.856  0.811  0.778
  2       0.01   0.0394  0.857  0.812  0.78 
  2       0.01   0.0536  0.855  0.809  0.779
  2       0.01   0.0729  0.854  0.796  0.786
  2       0.01   0.0992  0.854  0.789  0.783
  2       0.01   0.135   0.852  0.788  0.78 
  2       0.01   0.184   0.851  0.782  0.778
  2       0.01   0.25    0.85   0.784  0.78 

ROC was used to select the optimal model using  the largest value.
The final values used for the model were degree = 1, scale = 0.01 and C
 = 0.0729. 
> 
> svmPGrid2 <-  expand.grid(degree = 1:2,
+                           scale = c(0.01, .005),
+                           C = 2^(seq(-6, -2, length = 10)))
> set.seed(476)
> svmPFitReduced <- train(x = training[,reducedSet], 
+                         y = training$Class,
+                         method = "svmPoly",
+                         metric = "ROC",
+                         preProc = c("center", "scale"),
+                         tuneGrid = svmPGrid2,
+                         fit = FALSE,
+                         trControl = ctrl)
line search fails -2.047663 -0.1283902 1.181205e-05 2.17076e-06 -2.621876e-08 -4.051435e-09 -3.184921e-13Warning messages:
1: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.
2: In train.default(x = training[, reducedSet], y = training$Class,  :
  missing values found in aggregated results
> svmPFitReduced
Support Vector Machines with Polynomial Kernel 

8190 samples
 252 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  degree  scale  C       ROC    Sens   Spec 
  1       0.005  0.0156  0.867  0.94   0.653
  1       0.005  0.0213  0.875  0.926  0.707
  1       0.005  0.0289  0.881  0.912  0.738
  1       0.005  0.0394  0.887  0.909  0.743
  1       0.005  0.0536  0.892  0.904  0.762
  1       0.005  0.0729  0.895  0.895  0.772
  1       0.005  0.0992  0.897  0.863  0.781
  1       0.005  0.135   0.896  0.854  0.797
  1       0.005  0.184   0.896  0.849  0.804
  1       0.005  0.25    0.895  0.844  0.811
  1       0.01   0.0156  0.883  0.916  0.74 
  1       0.01   0.0213  0.888  0.911  0.749
  1       0.01   0.0289  0.893  0.898  0.762
  1       0.01   0.0394  0.896  0.886  0.775
  1       0.01   0.0536  0.896  0.863  0.785
  1       0.01   0.0729  0.897  0.853  0.8  
  1       0.01   0.0992  0.896  0.847  0.81 
  1       0.01   0.135   0.894  0.844  0.81 
  1       0.01   0.184   0.891  0.837  0.818
  1       0.01   0.25    0.888  0.816  0.825
  2       0.005  0.0156  0.88   0.902  0.746
  2       0.005  0.0213  0.886  0.896  0.759
  2       0.005  0.0289  0.89   0.879  0.774
  2       0.005  0.0394  0.894  0.877  0.777
  2       0.005  0.0536  0.896  0.854  0.794
  2       0.005  0.0729  0.898  0.842  0.805
  2       0.005  0.0992  0.898  0.83   0.815
  2       0.005  0.135   0.896  0.828  0.828
  2       0.005  0.184   0.896  0.819  0.828
  2       0.005  0.25    0.893  0.818  0.828
  2       0.01   0.0156  0.891  0.863  0.781
  2       0.01   0.0213  0.894  0.856  0.788
  2       0.01   0.0289  0.896  0.832  0.803
  2       0.01   0.0394  0.897  0.826  0.81 
  2       0.01   0.0536  0.896  0.833  0.818
  2       0.01   0.0729  0.893  0.819  0.821
  2       0.01   0.0992  NaN    NaN    NaN  
  2       0.01   0.135   0.887  0.802  0.83 
  2       0.01   0.184   0.883  0.804  0.832
  2       0.01   0.25    0.88   0.8    0.831

ROC was used to select the optimal model using  the largest value.
The final values used for the model were degree = 2, scale = 0.005 and C
 = 0.0729. 
> 
> svmPFitReduced$pred <- merge(svmPFitReduced$pred,  svmPFitReduced$bestTune)
> svmPCM <- confusionMatrix(svmPFitReduced, norm = "none")
> svmPRoc <- roc(response = svmPFitReduced$pred$obs,
+                predictor = svmPFitReduced$pred$successful,
+                levels = rev(levels(svmPFitReduced$pred$obs)))
> 
> 
> svmRadialResults <- rbind(svmRFitReduced$results,
+                           svmRFitFull$results)
> svmRadialResults$Set <- c(rep("Reduced Set", nrow(svmRFitReduced$result)),
+                           rep("Full Set", nrow(svmRFitFull$result)))
> svmRadialResults$Sigma <- paste("sigma = ", 
+                                 format(svmRadialResults$sigma, 
+                                        scientific = FALSE, digits= 5))
> svmRadialResults <- svmRadialResults[!is.na(svmRadialResults$ROC),]
> xyplot(ROC ~ C|Set, data = svmRadialResults,
+        groups = Sigma, type = c("g", "o"),
+        xlab = "Cost",
+        ylab = "ROC (2008 Hold-Out Data)",
+        auto.key = list(columns = 2),
+        scales = list(x = list(log = 2)))
> 
> svmPolyResults <- rbind(svmPFitReduced$results,
+                         svmPFitFull$results)
> svmPolyResults$Set <- c(rep("Reduced Set", nrow(svmPFitReduced$result)),
+                         rep("Full Set", nrow(svmPFitFull$result)))
> svmPolyResults <- svmPolyResults[!is.na(svmPolyResults$ROC),]
> svmPolyResults$scale <- paste("scale = ", 
+                               format(svmPolyResults$scale, 
+                                      scientific = FALSE))
> svmPolyResults$Degree <- "Linear"
> svmPolyResults$Degree[svmPolyResults$degree == 2] <- "Quadratic"
> useOuterStrips(xyplot(ROC ~ C|Degree*Set, data = svmPolyResults,
+                       groups = scale, type = c("g", "o"),
+                       xlab = "Cost",
+                       ylab = "ROC (2008 Hold-Out Data)",
+                       auto.key = list(columns = 2),
+                       scales = list(x = list(log = 2))))
> 
> plot(nnetRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = nnetFit4$pred$obs, predictor = nnetFit4$pred$successful,     levels = rev(levels(nnetFit4$pred$obs)))

Data: nnetFit4$pred$successful in 987 controls (nnetFit4$pred$obs unsuccessful) < 570 cases (nnetFit4$pred$obs successful).
Area under the curve: 0.9111
> plot(fdaRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = fdaFit$pred$obs, predictor = fdaFit$pred$successful,     levels = rev(levels(fdaFit$pred$obs)))

Data: fdaFit$pred$successful in 987 controls (fdaFit$pred$obs unsuccessful) < 570 cases (fdaFit$pred$obs successful).
Area under the curve: 0.924
> plot(svmPRoc, type = "s", add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = svmPFitReduced$pred$obs, predictor = svmPFitReduced$pred$successful,     levels = rev(levels(svmPFitReduced$pred$obs)))

Data: svmPFitReduced$pred$successful in 987 controls (svmPFitReduced$pred$obs unsuccessful) < 570 cases (svmPFitReduced$pred$obs successful).
Area under the curve: 0.8982
> 
> ################################################################################
> ### Section 13.5 K-Nearest Neighbors
> 
> 
> set.seed(476)
> knnFit <- train(x = training[,reducedSet], 
+                 y = training$Class,
+                 method = "knn",
+                 metric = "ROC",
+                 preProc = c("center", "scale"),
+                 tuneGrid = data.frame(k = c(4*(0:5)+1,20*(1:5)+1,50*(2:9)+1)),
+                 trControl = ctrl)
> knnFit
k-Nearest Neighbors 

8190 samples
 252 predictors
   2 classes: 'successful', 'unsuccessful' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  k    ROC    Sens   Spec   ROC SD  Sens SD  Spec SD
  1    0.622  0.547  0.694  NA      NA       NA     
  5    0.7    0.553  0.708  NA      NA       NA     
  9    0.706  0.542  0.746  NA      NA       NA     
  13   0.709  0.558  0.743  NA      NA       NA     
  17   0.711  0.565  0.737  NA      NA       NA     
  21   0.724  0.542  0.739  0       0        0.00143
  41   0.734  0.575  0.757  NA      NA       NA     
  61   0.75   0.556  0.785  NA      NA       NA     
  81   0.762  0.535  0.811  NA      NA       NA     
  101  0.766  0.52   0.825  0       0.00124  0      
  151  0.773  0.454  0.866  NA      NA       NA     
  201  0.779  0.395  0.891  NA      NA       NA     
  251  0.781  0.351  0.897  NA      NA       NA     
  301  0.787  0.333  0.907  NA      NA       NA     
  351  0.792  0.312  0.906  NA      NA       NA     
  401  0.797  0.337  0.905  NA      NA       NA     
  451  0.807  0.353  0.908  NA      NA       NA     

ROC was used to select the optimal model using  the largest value.
The final value used for the model was k = 451. 
> 
> knnFit$pred <- merge(knnFit$pred,  knnFit$bestTune)
> knnCM <- confusionMatrix(knnFit, norm = "none")
> knnCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          201           91
  unsuccessful        369          896
                                          
               Accuracy : 0.7046          
                 95% CI : (0.6812, 0.7271)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : 2.461e-09       
                                          
                  Kappa : 0.2903          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.3526          
            Specificity : 0.9078          
         Pos Pred Value : 0.6884          
         Neg Pred Value : 0.7083          
             Prevalence : 0.3661          
         Detection Rate : 0.1291          
   Detection Prevalence : 0.1875          
      Balanced Accuracy : 0.6302          
                                          
       'Positive' Class : successful      
                                          

> knnRoc <- roc(response = knnFit$pred$obs,
+               predictor = knnFit$pred$successful,
+               levels = rev(levels(knnFit$pred$obs)))
> 
> update(plot(knnFit, ylab = "ROC (2008 Hold-Out Data)"))
> 
> plot(fdaRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = fdaFit$pred$obs, predictor = fdaFit$pred$successful,     levels = rev(levels(fdaFit$pred$obs)))

Data: fdaFit$pred$successful in 987 controls (fdaFit$pred$obs unsuccessful) < 570 cases (fdaFit$pred$obs successful).
Area under the curve: 0.924
> plot(nnetRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = nnetFit4$pred$obs, predictor = nnetFit4$pred$successful,     levels = rev(levels(nnetFit4$pred$obs)))

Data: nnetFit4$pred$successful in 987 controls (nnetFit4$pred$obs unsuccessful) < 570 cases (nnetFit4$pred$obs successful).
Area under the curve: 0.9111
> plot(svmPRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = svmPFitReduced$pred$obs, predictor = svmPFitReduced$pred$successful,     levels = rev(levels(svmPFitReduced$pred$obs)))

Data: svmPFitReduced$pred$successful in 987 controls (svmPFitReduced$pred$obs unsuccessful) < 570 cases (svmPFitReduced$pred$obs successful).
Area under the curve: 0.8982
> plot(knnRoc, type = "s", add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = knnFit$pred$obs, predictor = knnFit$pred$successful,     levels = rev(levels(knnFit$pred$obs)))

Data: knnFit$pred$successful in 987 controls (knnFit$pred$obs unsuccessful) < 570 cases (knnFit$pred$obs successful).
Area under the curve: 0.8068
> 
> ################################################################################
> ### Section 13.6 Naive Bayes
> 
> ## Create factor versions of some of the predictors so that they are treated
> ## as categories and not dummy variables
> 
> factors <- c("SponsorCode", "ContractValueBand", "Month", "Weekday")
> nbPredictors <- factorPredictors[factorPredictors %in% reducedSet]
> nbPredictors <- c(nbPredictors, factors)
> nbPredictors <- nbPredictors[nbPredictors != "SponsorUnk"]
> 
> nbTraining <- training[, c("Class", nbPredictors)]
> nbTesting <- testing[, c("Class", nbPredictors)]
> 
> for(i in nbPredictors)
+ {
+   if(length(unique(training[,i])) <= 15)
+   {
+     nbTraining[, i] <- factor(nbTraining[,i], levels = paste(sort(unique(training[,i]))))
+     nbTesting[, i] <- factor(nbTesting[,i], levels = paste(sort(unique(training[,i]))))
+   }
+ }
> 
> set.seed(476)
> nBayesFit <- train(x = nbTraining[,nbPredictors],
+                    y = nbTraining$Class,
+                    method = "nb",
+                    metric = "ROC",
+                    tuneGrid = data.frame(usekernel = c(TRUE, FALSE), fL = 2),
+                    trControl = ctrl)
Loading required package: klaR
Loading required package: MASS
> nBayesFit
Naive Bayes 

8190 samples
 205 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  usekernel  ROC    Sens   Spec 
  FALSE      0.782  0.588  0.796
  TRUE       0.814  0.644  0.824

Tuning parameter 'fL' was held constant at a value of 2
ROC was used to select the optimal model using  the largest value.
The final values used for the model were fL = 2 and usekernel = TRUE. 
> 
> nBayesFit$pred <- merge(nBayesFit$pred,  nBayesFit$bestTune)
> nBayesCM <- confusionMatrix(nBayesFit, norm = "none")
> nBayesCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          367          174
  unsuccessful        203          813
                                         
               Accuracy : 0.7579         
                 95% CI : (0.7358, 0.779)
    No Information Rate : 0.6339         
    P-Value [Acc > NIR] : <2e-16         
                                         
                  Kappa : 0.4726         
 Mcnemar's Test P-Value : 0.1493         
                                         
            Sensitivity : 0.6439         
            Specificity : 0.8237         
         Pos Pred Value : 0.6784         
         Neg Pred Value : 0.8002         
             Prevalence : 0.3661         
         Detection Rate : 0.2357         
   Detection Prevalence : 0.3475         
      Balanced Accuracy : 0.7338         
                                         
       'Positive' Class : successful     
                                         

> nBayesRoc <- roc(response = nBayesFit$pred$obs,
+                  predictor = nBayesFit$pred$successful,
+                  levels = rev(levels(nBayesFit$pred$obs)))
> nBayesRoc

Call:
roc.default(response = nBayesFit$pred$obs, predictor = nBayesFit$pred$successful,     levels = rev(levels(nBayesFit$pred$obs)))

Data: nBayesFit$pred$successful in 987 controls (nBayesFit$pred$obs unsuccessful) < 570 cases (nBayesFit$pred$obs successful).
Area under the curve: 0.8137
> 
> 
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] klaR_0.6-7          MASS_7.3-26         kernlab_0.9-16     
 [4] earth_3.2-3         plotrix_3.4-6       plotmo_1.3-2       
 [7] leaps_2.9           latticeExtra_0.6-24 RColorBrewer_1.0-5 
[10] nnet_7.3-6          e1071_1.6-1         pROC_1.5.4         
[13] plyr_1.8            mda_0.4-2           class_7.3-7        
[16] doMC_1.3.0          iterators_1.0.6     foreach_1.4.0      
[19] caret_6.0-22        ggplot2_0.9.3.1     lattice_0.20-15    

loaded via a namespace (and not attached):
 [1] car_2.0-16       codetools_0.2-8  colorspace_1.2-1 compiler_3.0.1  
 [5] dichromat_2.0-0  digest_0.6.3     grid_3.0.1       gtable_0.1.2    
 [9] labeling_0.1     munsell_0.4      proto_0.3-10     reshape2_1.2.2  
[13] scales_0.2.3     stringr_0.6.2   
> 
> q("no")
> proc.time()
     user    system   elapsed 
313451.24   2270.67  52861.72 
In [74]:
%%R -w 600 -h 600

##  runChapterScript(13)

##       user    system   elapsed 
##  313451.24   2270.67  52861.72
NULL
In [87]:
%%R

showChapterScript(14)
NULL
In [76]:
%%R

showChapterOutput(14)
R Information
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com) 
> ###
> ### Chapter 14 Classification Trees and Rule Based Models
> ###
> ### Required packages: AppliedPredictiveModeling, C50, caret, doMC (optional),
> ###                    gbm, lattice, partykit, pROC, randomForest, reshape2,
> ###                    rpart, RWeka
> ###
> ### Data used: The grant application data. See the file 'CreateGrantData.R'
> ###
> ### Notes: 
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be 
> ### syntax differences that occur over time as packages evolve. These files 
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in 
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
> 
> ### NOTE: Many of the models here are computationally expensive. If
> ### this script is run as-is, the memory requirements will accumulate
> ### until it exceeds 32gb. 
> 
> ################################################################################
> ### Section 14.1 Basic Classification Trees
> 
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> 
> load("grantData.RData")
> 
> ctrl <- trainControl(method = "LGOCV",
+                      summaryFunction = twoClassSummary,
+                      classProbs = TRUE,
+                      index = list(TrainSet = pre2008),
+                      savePredictions = TRUE)
> 
> set.seed(476)
> rpartFit <- train(x = training[,fullSet], 
+                   y = training$Class,
+                   method = "rpart",
+                   tuneLength = 30,
+                   metric = "ROC",
+                   trControl = ctrl)
Loading required package: rpart
Loading required package: pROC
Loading required package: plyr
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'

The following object is masked from 'package:stats':

    cov, smooth, var

> rpartFit
CART 

8190 samples
1070 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  cp        ROC    Sens   Spec 
  0.000351  0.895  0.779  0.837
  0.000394  0.895  0.779  0.837
  0.000526  0.896  0.804  0.841
  0.000657  0.897  0.823  0.83 
  0.000789  0.897  0.793  0.839
  0.000877  0.897  0.877  0.818
  0.000894  0.897  0.877  0.818
  0.00092   0.897  0.877  0.818
  0.00105   0.898  0.881  0.806
  0.00131   0.906  0.882  0.816
  0.00145   0.91   0.844  0.848
  0.00158   0.911  0.847  0.846
  0.0021    0.912  0.811  0.862
  0.00224   0.912  0.811  0.862
  0.00237   0.912  0.811  0.862
  0.00272   0.912  0.811  0.862
  0.00276   0.912  0.811  0.862
  0.0028    0.912  0.8    0.865
  0.00289   0.912  0.8    0.865
  0.00394   0.883  0.886  0.811
  0.00421   0.875  0.858  0.81 
  0.0046    0.875  0.858  0.81 
  0.00526   0.874  0.858  0.81 
  0.00736   0.884  0.837  0.813
  0.0113    0.884  0.837  0.813
  0.021     0.871  0.947  0.727
  0.0227    0.871  0.947  0.727
  0.0465    0.85   0.944  0.735
  0.0715    0.852  0.944  0.738
  0.387     0.815  0.991  0.638

ROC was used to select the optimal model using  the largest value.
The final value used for the model was cp = 0.00289. 
> 
> library(partykit)
Loading required package: grid
> plot(as.party(rpartFit$finalModel))
> 
> rpart2008 <- merge(rpartFit$pred,  rpartFit$bestTune)
> rpartCM <- confusionMatrix(rpartFit, norm = "none")
> rpartCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Loading required package: class
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          456          133
  unsuccessful        114          854
                                          
               Accuracy : 0.8414          
                 95% CI : (0.8223, 0.8592)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.6606          
 Mcnemar's Test P-Value : 0.2521          
                                          
            Sensitivity : 0.8000          
            Specificity : 0.8652          
         Pos Pred Value : 0.7742          
         Neg Pred Value : 0.8822          
             Prevalence : 0.3661          
         Detection Rate : 0.2929          
   Detection Prevalence : 0.3783          
      Balanced Accuracy : 0.8326          
                                          
       'Positive' Class : successful      
                                          

> rpartRoc <- roc(response = rpartFit$pred$obs,
+                 predictor = rpartFit$pred$successful,
+                 levels = rev(levels(rpartFit$pred$obs)))
> 
> set.seed(476)
> rpartFactorFit <- train(x = training[,factorPredictors], 
+                         y = training$Class,
+                         method = "rpart",
+                         tuneLength = 30,
+                         metric = "ROC",
+                         trControl = ctrl)
> rpartFactorFit 
CART 

8190 samples
1488 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  cp        ROC    Sens   Spec 
  0.000175  0.901  0.735  0.87 
  0.00021   0.901  0.735  0.87 
  0.000263  0.901  0.735  0.87 
  0.000368  0.891  0.761  0.864
  0.000376  0.891  0.761  0.864
  0.000394  0.891  0.761  0.864
  0.000526  0.891  0.775  0.865
  0.000657  0.895  0.795  0.866
  0.000789  0.899  0.821  0.864
  0.000877  0.899  0.821  0.864
  0.00092   0.899  0.821  0.864
  0.00105   0.897  0.825  0.856
  0.00118   0.898  0.825  0.853
  0.00131   0.894  0.837  0.847
  0.00145   0.894  0.837  0.847
  0.00184   0.902  0.825  0.855
  0.00237   0.902  0.825  0.858
  0.0025    0.903  0.821  0.866
  0.00263   0.903  0.821  0.866
  0.00289   0.91   0.812  0.872
  0.00394   0.892  0.847  0.831
  0.00539   0.892  0.847  0.831
  0.0071    0.892  0.847  0.831
  0.00763   0.901  0.847  0.831
  0.0116    0.899  0.828  0.834
  0.0146    0.899  0.828  0.834
  0.0318    0.9    0.823  0.841
  0.0652    0.867  0.865  0.779
  0.153     0.817  0.988  0.645
  0.393     0.817  0.988  0.645

ROC was used to select the optimal model using  the largest value.
The final value used for the model was cp = 0.00289. 
> plot(as.party(rpartFactorFit$finalModel))
> 
> rpartFactor2008 <- merge(rpartFactorFit$pred,  rpartFactorFit$bestTune)
> rpartFactorCM <- confusionMatrix(rpartFactorFit, norm = "none")
> rpartFactorCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          463          126
  unsuccessful        107          861
                                          
               Accuracy : 0.8504          
                 95% CI : (0.8317, 0.8677)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.6798          
 Mcnemar's Test P-Value : 0.2383          
                                          
            Sensitivity : 0.8123          
            Specificity : 0.8723          
         Pos Pred Value : 0.7861          
         Neg Pred Value : 0.8895          
             Prevalence : 0.3661          
         Detection Rate : 0.2974          
   Detection Prevalence : 0.3783          
      Balanced Accuracy : 0.8423          
                                          
       'Positive' Class : successful      
                                          

> 
> rpartFactorRoc <- roc(response = rpartFactorFit$pred$obs,
+                       predictor = rpartFactorFit$pred$successful,
+                       levels = rev(levels(rpartFactorFit$pred$obs)))
> 
> plot(rpartRoc, type = "s", print.thres = c(.5),
+      print.thres.pch = 3,
+      print.thres.pattern = "",
+      print.thres.cex = 1.2,
+      col = "red", legacy.axes = TRUE,
+      print.thres.col = "red")

Call:
roc.default(response = rpartFit$pred$obs, predictor = rpartFit$pred$successful,     levels = rev(levels(rpartFit$pred$obs)))

Data: rpartFit$pred$successful in 29610 controls (rpartFit$pred$obs unsuccessful) < 17100 cases (rpartFit$pred$obs successful).
Area under the curve: 0.8915
> plot(rpartFactorRoc,
+      type = "s",
+      add = TRUE,
+      print.thres = c(.5),
+      print.thres.pch = 16, legacy.axes = TRUE,
+      print.thres.pattern = "",
+      print.thres.cex = 1.2)

Call:
roc.default(response = rpartFactorFit$pred$obs, predictor = rpartFactorFit$pred$successful,     levels = rev(levels(rpartFactorFit$pred$obs)))

Data: rpartFactorFit$pred$successful in 29610 controls (rpartFactorFit$pred$obs unsuccessful) < 17100 cases (rpartFactorFit$pred$obs successful).
Area under the curve: 0.8856
> legend(.75, .2,
+        c("Grouped Categories", "Independent Categories"),
+        lwd = c(1, 1),
+        col = c("black", "red"),
+        pch = c(16, 3))
> 
> set.seed(476)
> j48FactorFit <- train(x = training[,factorPredictors], 
+                       y = training$Class,
+                       method = "J48",
+                       metric = "ROC",
+                       trControl = ctrl)
Loading required package: RWeka
> j48FactorFit
C4.5-like Trees 

8190 samples
1488 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results

  ROC    Sens   Spec 
  0.835  0.839  0.817

Tuning parameter 'C' was held constant at a value of 0.25
 
> 
> j48Factor2008 <- merge(j48FactorFit$pred,  j48FactorFit$bestTune)
> j48FactorCM <- confusionMatrix(j48FactorFit, norm = "none")
> j48FactorCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          478          181
  unsuccessful         92          806
                                          
               Accuracy : 0.8247          
                 95% CI : (0.8048, 0.8432)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6343          
 Mcnemar's Test P-Value : 1.004e-07       
                                          
            Sensitivity : 0.8386          
            Specificity : 0.8166          
         Pos Pred Value : 0.7253          
         Neg Pred Value : 0.8976          
             Prevalence : 0.3661          
         Detection Rate : 0.3070          
   Detection Prevalence : 0.4232          
      Balanced Accuracy : 0.8276          
                                          
       'Positive' Class : successful      
                                          

> 
> j48FactorRoc <- roc(response = j48FactorFit$pred$obs,
+                     predictor = j48FactorFit$pred$successful,
+                     levels = rev(levels(j48FactorFit$pred$obs)))
> 
> set.seed(476)
> j48Fit <- train(x = training[,fullSet], 
+                 y = training$Class,
+                 method = "J48",
+                 metric = "ROC",
+                 trControl = ctrl)
> 
> j482008 <- merge(j48Fit$pred,  j48Fit$bestTune)
> j48CM <- confusionMatrix(j48Fit, norm = "none")
> j48CM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          438          160
  unsuccessful        132          827
                                          
               Accuracy : 0.8125          
                 95% CI : (0.7922, 0.8316)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.6001          
 Mcnemar's Test P-Value : 0.1141          
                                          
            Sensitivity : 0.7684          
            Specificity : 0.8379          
         Pos Pred Value : 0.7324          
         Neg Pred Value : 0.8624          
             Prevalence : 0.3661          
         Detection Rate : 0.2813          
   Detection Prevalence : 0.3841          
      Balanced Accuracy : 0.8032          
                                          
       'Positive' Class : successful      
                                          

> 
> j48Roc <- roc(response = j48Fit$pred$obs,
+               predictor = j48Fit$pred$successful,
+               levels = rev(levels(j48Fit$pred$obs)))
> 
> 
> plot(j48FactorRoc, type = "s", print.thres = c(.5), 
+      print.thres.pch = 16, print.thres.pattern = "", 
+      print.thres.cex = 1.2, legacy.axes = TRUE)

Call:
roc.default(response = j48FactorFit$pred$obs, predictor = j48FactorFit$pred$successful,     levels = rev(levels(j48FactorFit$pred$obs)))

Data: j48FactorFit$pred$successful in 987 controls (j48FactorFit$pred$obs unsuccessful) < 570 cases (j48FactorFit$pred$obs successful).
Area under the curve: 0.8353
> plot(j48Roc, type = "s", print.thres = c(.5), 
+      print.thres.pch = 3, print.thres.pattern = "", 
+      print.thres.cex = 1.2, legacy.axes = TRUE,
+      add = TRUE, col = "red", print.thres.col = "red")

Call:
roc.default(response = j48Fit$pred$obs, predictor = j48Fit$pred$successful,     levels = rev(levels(j48Fit$pred$obs)))

Data: j48Fit$pred$successful in 987 controls (j48Fit$pred$obs unsuccessful) < 570 cases (j48Fit$pred$obs successful).
Area under the curve: 0.842
> legend(.75, .2,
+        c("Grouped Categories", "Independent Categories"),
+        lwd = c(1, 1),
+        col = c("black", "red"),
+        pch = c(16, 3))
> 
> plot(rpartFactorRoc, type = "s", add = TRUE, 
+      col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = rpartFactorFit$pred$obs, predictor = rpartFactorFit$pred$successful,     levels = rev(levels(rpartFactorFit$pred$obs)))

Data: rpartFactorFit$pred$successful in 29610 controls (rpartFactorFit$pred$obs unsuccessful) < 17100 cases (rpartFactorFit$pred$obs successful).
Area under the curve: 0.8856
> 
> ################################################################################
> ### Section 14.2 Rule-Based Models
> 
> set.seed(476)
> partFit <- train(x = training[,fullSet], 
+                  y = training$Class,
+                  method = "PART",
+                  metric = "ROC",
+                  trControl = ctrl)
> partFit
Rule-Based Classifier 

8190 samples
1070 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results

  ROC    Sens   Spec 
  0.809  0.779  0.802

Tuning parameter 'threshold' was held constant at a value of 0.25

Tuning parameter 'pruned' was held constant at a value of yes
 
> 
> part2008 <- merge(partFit$pred,  partFit$bestTune)
> partCM <- confusionMatrix(partFit, norm = "none")
> partCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          444          195
  unsuccessful        126          792
                                          
               Accuracy : 0.7938          
                 95% CI : (0.7729, 0.8137)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5669          
 Mcnemar's Test P-Value : 0.0001474       
                                          
            Sensitivity : 0.7789          
            Specificity : 0.8024          
         Pos Pred Value : 0.6948          
         Neg Pred Value : 0.8627          
             Prevalence : 0.3661          
         Detection Rate : 0.2852          
   Detection Prevalence : 0.4104          
      Balanced Accuracy : 0.7907          
                                          
       'Positive' Class : successful      
                                          

> 
> partRoc <- roc(response = partFit$pred$obs,
+                predictor = partFit$pred$successful,
+                levels = rev(levels(partFit$pred$obs)))
> partRoc

Call:
roc.default(response = partFit$pred$obs, predictor = partFit$pred$successful,     levels = rev(levels(partFit$pred$obs)))

Data: partFit$pred$successful in 987 controls (partFit$pred$obs unsuccessful) < 570 cases (partFit$pred$obs successful).
Area under the curve: 0.809
> 
> set.seed(476)
> partFactorFit <- train(training[,factorPredictors], training$Class,
+                        method = "PART",
+                        metric = "ROC",
+                        trControl = ctrl)
> partFactorFit
Rule-Based Classifier 

8190 samples
1488 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results

  ROC    Sens   Spec 
  0.835  0.807  0.766

Tuning parameter 'threshold' was held constant at a value of 0.25

Tuning parameter 'pruned' was held constant at a value of yes
 
> 
> partFactor2008 <- merge(partFactorFit$pred,  partFactorFit$bestTune)
> partFactorCM <- confusionMatrix(partFactorFit, norm = "none")
> partFactorCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          460          231
  unsuccessful        110          756
                                          
               Accuracy : 0.781           
                 95% CI : (0.7596, 0.8013)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5484          
 Mcnemar's Test P-Value : 8.12e-11        
                                          
            Sensitivity : 0.8070          
            Specificity : 0.7660          
         Pos Pred Value : 0.6657          
         Neg Pred Value : 0.8730          
             Prevalence : 0.3661          
         Detection Rate : 0.2954          
   Detection Prevalence : 0.4438          
      Balanced Accuracy : 0.7865          
                                          
       'Positive' Class : successful      
                                          

> 
> partFactorRoc <- roc(response = partFactorFit$pred$obs,
+                      predictor = partFactorFit$pred$successful,
+                      levels = rev(levels(partFactorFit$pred$obs)))
> partFactorRoc

Call:
roc.default(response = partFactorFit$pred$obs, predictor = partFactorFit$pred$successful,     levels = rev(levels(partFactorFit$pred$obs)))

Data: partFactorFit$pred$successful in 987 controls (partFactorFit$pred$obs unsuccessful) < 570 cases (partFactorFit$pred$obs successful).
Area under the curve: 0.8347
> 
> ################################################################################
> ### Section 14.3 Bagged Trees
> 
> set.seed(476)
> treebagFit <- train(x = training[,fullSet], 
+                     y = training$Class,
+                     method = "treebag",
+                     nbagg = 50,
+                     metric = "ROC",
+                     trControl = ctrl)
Loading required package: ipred
Loading required package: MASS
Loading required package: survival
Loading required package: splines

Attaching package: 'survival'

The following object is masked from 'package:caret':

    cluster

Loading required package: nnet
Loading required package: prodlim
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
> treebagFit
Bagged CART 

8190 samples
1070 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results

  ROC    Sens  Spec 
  0.921  0.83  0.857

 
> 
> treebag2008 <- merge(treebagFit$pred,  treebagFit$bestTune)
> treebagCM <- confusionMatrix(treebagFit, norm = "none")
> treebagCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          473          141
  unsuccessful         97          846
                                          
               Accuracy : 0.8471          
                 95% CI : (0.8283, 0.8647)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6759          
 Mcnemar's Test P-Value : 0.005315        
                                          
            Sensitivity : 0.8298          
            Specificity : 0.8571          
         Pos Pred Value : 0.7704          
         Neg Pred Value : 0.8971          
             Prevalence : 0.3661          
         Detection Rate : 0.3038          
   Detection Prevalence : 0.3943          
      Balanced Accuracy : 0.8435          
                                          
       'Positive' Class : successful      
                                          

> 
> treebagRoc <- roc(response = treebagFit$pred$obs,
+                   predictor = treebagFit$pred$successful,
+                   levels = rev(levels(treebagFit$pred$obs)))
> set.seed(476)
> treebagFactorFit <- train(x = training[,factorPredictors], 
+                           y = training$Class,
+                           method = "treebag",
+                           nbagg = 50,
+                           metric = "ROC",
+                           trControl = ctrl)
> treebagFactorFit
Bagged CART 

8190 samples
1488 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results

  ROC    Sens   Spec 
  0.917  0.835  0.861

 
> 
> treebagFactor2008 <- merge(treebagFactorFit$pred,  treebagFactorFit$bestTune)
> treebagFactorCM <- confusionMatrix(treebagFactorFit, norm = "none")
> treebagFactorCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          476          137
  unsuccessful         94          850
                                         
               Accuracy : 0.8516         
                 95% CI : (0.833, 0.8689)
    No Information Rate : 0.6339         
    P-Value [Acc > NIR] : < 2e-16        
                                         
                  Kappa : 0.6854         
 Mcnemar's Test P-Value : 0.00572        
                                         
            Sensitivity : 0.8351         
            Specificity : 0.8612         
         Pos Pred Value : 0.7765         
         Neg Pred Value : 0.9004         
             Prevalence : 0.3661         
         Detection Rate : 0.3057         
   Detection Prevalence : 0.3937         
      Balanced Accuracy : 0.8481         
                                         
       'Positive' Class : successful     
                                         

> treebagFactorRoc <- roc(response = treebagFactorFit$pred$obs,
+                         predictor = treebagFactorFit$pred$successful,
+                         levels = rev(levels(treebagFactorFit$pred$obs)))
> 
> 
> plot(rpartRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = rpartFit$pred$obs, predictor = rpartFit$pred$successful,     levels = rev(levels(rpartFit$pred$obs)))

Data: rpartFit$pred$successful in 29610 controls (rpartFit$pred$obs unsuccessful) < 17100 cases (rpartFit$pred$obs successful).
Area under the curve: 0.8915
> plot(j48FactorRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), 
+      legacy.axes = TRUE)

Call:
roc.default(response = j48FactorFit$pred$obs, predictor = j48FactorFit$pred$successful,     levels = rev(levels(j48FactorFit$pred$obs)))

Data: j48FactorFit$pred$successful in 987 controls (j48FactorFit$pred$obs unsuccessful) < 570 cases (j48FactorFit$pred$obs successful).
Area under the curve: 0.8353
> plot(treebagRoc, type = "s", add = TRUE, print.thres = c(.5), 
+      print.thres.pch = 3, legacy.axes = TRUE, print.thres.pattern = "", 
+      print.thres.cex = 1.2,
+      col = "red", print.thres.col = "red")

Call:
roc.default(response = treebagFit$pred$obs, predictor = treebagFit$pred$successful,     levels = rev(levels(treebagFit$pred$obs)))

Data: treebagFit$pred$successful in 987 controls (treebagFit$pred$obs unsuccessful) < 570 cases (treebagFit$pred$obs successful).
Area under the curve: 0.9205
> plot(treebagFactorRoc, type = "s", add = TRUE, print.thres = c(.5), 
+      print.thres.pch = 16, print.thres.pattern = "", legacy.axes = TRUE, 
+      print.thres.cex = 1.2)

Call:
roc.default(response = treebagFactorFit$pred$obs, predictor = treebagFactorFit$pred$successful,     levels = rev(levels(treebagFactorFit$pred$obs)))

Data: treebagFactorFit$pred$successful in 987 controls (treebagFactorFit$pred$obs unsuccessful) < 570 cases (treebagFactorFit$pred$obs successful).
Area under the curve: 0.9173
> legend(.75, .2,
+        c("Grouped Categories", "Independent Categories"),
+        lwd = c(1, 1),
+        col = c("black", "red"),
+        pch = c(16, 3))
> 
> ################################################################################
> ### Section 14.4 Random Forests
> 
> ### For the book, this model was run with only 500 trees (by
> ### accident). More than 1000 trees usually required to get consistent
> ### results.
> 
> mtryValues <- c(5, 10, 20, 32, 50, 100, 250, 500, 1000)
> set.seed(476)
> rfFit <- train(x = training[,fullSet], 
+                y = training$Class,
+                method = "rf",
+                ntree = 500,
+                tuneGrid = data.frame(mtry = mtryValues),
+                importance = TRUE,
+                metric = "ROC",
+                trControl = ctrl)
Loading required package: randomForest
randomForest 4.6-7
Type rfNews() to see new features/changes/bug fixes.
> rfFit
Random Forest 

8190 samples
1070 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  mtry  ROC    Sens   Spec 
  5     0.876  0.805  0.769
  10    0.901  0.828  0.812
  20    0.924  0.861  0.827
  32    0.931  0.879  0.835
  50    0.936  0.877  0.835
  100   0.939  0.867  0.846
  250   0.937  0.856  0.858
  500   0.93   0.844  0.862
  1000  0.923  0.837  0.853

ROC was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 100. 
> 
> rf2008 <- merge(rfFit$pred,  rfFit$bestTune)
> rfCM <- confusionMatrix(rfFit, norm = "none")
> rfCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          494          152
  unsuccessful         76          835
                                         
               Accuracy : 0.8536         
                 95% CI : (0.835, 0.8708)
    No Information Rate : 0.6339         
    P-Value [Acc > NIR] : < 2e-16        
                                         
                  Kappa : 0.6931         
 Mcnemar's Test P-Value : 6.8e-07        
                                         
            Sensitivity : 0.8667         
            Specificity : 0.8460         
         Pos Pred Value : 0.7647         
         Neg Pred Value : 0.9166         
             Prevalence : 0.3661         
         Detection Rate : 0.3173         
   Detection Prevalence : 0.4149         
      Balanced Accuracy : 0.8563         
                                         
       'Positive' Class : successful     
                                         

> 
> rfRoc <- roc(response = rfFit$pred$obs,
+              predictor = rfFit$pred$successful,
+              levels = rev(levels(rfFit$pred$obs)))
> 
> gc()
             used    (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells    8050579   430.0   13156139   702.7   13156139   702.7
Vcells 4127289672 31488.8 6062765953 46255.3 5498501682 41950.3
> 
> ## The randomForest package cannot handle factors with more than 32
> ## levels, so we make a new set of predictors where the sponsor code
> ## factor is entered as dummy variables instead of a single factor. 
> 
> sponsorVars <- grep("Sponsor", names(training), value = TRUE)
> sponsorVars <- sponsorVars[sponsorVars != "SponsorCode"]
> 
> rfPredictors <- factorPredictors
> rfPredictors <- rfPredictors[rfPredictors != "SponsorCode"]
> rfPredictors <- c(rfPredictors, sponsorVars)
> 
> set.seed(476)
> rfFactorFit <- train(x = training[,rfPredictors], 
+                      y = training$Class,
+                      method = "rf",
+                      ntree = 1500,
+                      tuneGrid = data.frame(mtry = mtryValues),
+                      importance = TRUE,
+                      metric = "ROC",
+                      trControl = ctrl)
> rfFactorFit
Random Forest 

8190 samples
1733 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  mtry  ROC    Sens   Spec 
  5     0.808  0.619  0.817
  10    0.855  0.726  0.815
  20    0.891  0.754  0.84 
  32    0.911  0.774  0.855
  50    0.921  0.802  0.865
  100   0.93   0.823  0.87 
  250   0.937  0.842  0.871
  500   0.936  0.847  0.876
  1000  0.931  0.837  0.872

ROC was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 250. 
> 
> rfFactor2008 <- merge(rfFactorFit$pred,  rfFactorFit$bestTune)
> rfFactorCM <- confusionMatrix(rfFactorFit, norm = "none")
> rfFactorCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          480          127
  unsuccessful         90          860
                                          
               Accuracy : 0.8606          
                 95% CI : (0.8424, 0.8775)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.7038          
 Mcnemar's Test P-Value : 0.01453         
                                          
            Sensitivity : 0.8421          
            Specificity : 0.8713          
         Pos Pred Value : 0.7908          
         Neg Pred Value : 0.9053          
             Prevalence : 0.3661          
         Detection Rate : 0.3083          
   Detection Prevalence : 0.3899          
      Balanced Accuracy : 0.8567          
                                          
       'Positive' Class : successful      
                                          

> 
> rfFactorRoc <- roc(response = rfFactorFit$pred$obs,
+                    predictor = rfFactorFit$pred$successful,
+                    levels = rev(levels(rfFactorFit$pred$obs)))
> 
> plot(treebagRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = treebagFit$pred$obs, predictor = treebagFit$pred$successful,     levels = rev(levels(treebagFit$pred$obs)))

Data: treebagFit$pred$successful in 987 controls (treebagFit$pred$obs unsuccessful) < 570 cases (treebagFit$pred$obs successful).
Area under the curve: 0.9205
> plot(rpartRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = rpartFit$pred$obs, predictor = rpartFit$pred$successful,     levels = rev(levels(rpartFit$pred$obs)))

Data: rpartFit$pred$successful in 29610 controls (rpartFit$pred$obs unsuccessful) < 17100 cases (rpartFit$pred$obs successful).
Area under the curve: 0.8915
> plot(j48FactorRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), 
+      legacy.axes = TRUE)

Call:
roc.default(response = j48FactorFit$pred$obs, predictor = j48FactorFit$pred$successful,     levels = rev(levels(j48FactorFit$pred$obs)))

Data: j48FactorFit$pred$successful in 987 controls (j48FactorFit$pred$obs unsuccessful) < 570 cases (j48FactorFit$pred$obs successful).
Area under the curve: 0.8353
> plot(rfRoc, type = "s", add = TRUE, print.thres = c(.5), 
+      print.thres.pch = 3, legacy.axes = TRUE, print.thres.pattern = "", 
+      print.thres.cex = 1.2,
+      col = "red", print.thres.col = "red")

Call:
roc.default(response = rfFit$pred$obs, predictor = rfFit$pred$successful,     levels = rev(levels(rfFit$pred$obs)))

Data: rfFit$pred$successful in 8883 controls (rfFit$pred$obs unsuccessful) < 5130 cases (rfFit$pred$obs successful).
Area under the curve: 0.9179
> plot(rfFactorRoc, type = "s", add = TRUE, print.thres = c(.5), 
+      print.thres.pch = 16, print.thres.pattern = "", legacy.axes = TRUE, 
+      print.thres.cex = 1.2)

Call:
roc.default(response = rfFactorFit$pred$obs, predictor = rfFactorFit$pred$successful,     levels = rev(levels(rfFactorFit$pred$obs)))

Data: rfFactorFit$pred$successful in 8883 controls (rfFactorFit$pred$obs unsuccessful) < 5130 cases (rfFactorFit$pred$obs successful).
Area under the curve: 0.9049
> legend(.75, .2,
+        c("Grouped Categories", "Independent Categories"),
+        lwd = c(1, 1),
+        col = c("black", "red"),
+        pch = c(16, 3))
> 
> 
> ################################################################################
> ### Section 14.5 Boosting
> 
> gbmGrid <- expand.grid(interaction.depth = c(1, 3, 5, 7, 9),
+                        n.trees = (1:20)*100,
+                        shrinkage = c(.01, .1))
> 
> set.seed(476)
> gbmFit <- train(x = training[,fullSet], 
+                 y = training$Class,
+                 method = "gbm",
+                 tuneGrid = gbmGrid,
+                 metric = "ROC",
+                 verbose = FALSE,
+                 trControl = ctrl)
Loading required package: gbm
Loading required package: parallel
Loaded gbm 2.1
> gbmFit
Stochastic Gradient Boosting 

8190 samples
1070 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  shrinkage  interaction.depth  n.trees  ROC    Sens   Spec 
  0.01       1                  100      0.879  0.947  0.73 
  0.01       1                  200      0.887  0.947  0.73 
  0.01       1                  300      0.888  0.951  0.73 
  0.01       1                  400      0.911  0.951  0.73 
  0.01       1                  500      0.906  0.951  0.73 
  0.01       1                  600      0.908  0.904  0.8  
  0.01       1                  700      0.907  0.905  0.799
  0.01       1                  800      0.91   0.905  0.798
  0.01       1                  900      0.911  0.904  0.799
  0.01       1                  1000     0.914  0.904  0.799
  0.01       1                  1100     0.914  0.9    0.807
  0.01       1                  1200     0.915  0.898  0.814
  0.01       1                  1300     0.916  0.893  0.816
  0.01       1                  1400     0.917  0.889  0.821
  0.01       1                  1500     0.918  0.875  0.822
  0.01       1                  1600     0.919  0.877  0.826
  0.01       1                  1700     0.919  0.865  0.831
  0.01       1                  1800     0.919  0.839  0.842
  0.01       1                  1900     0.92   0.858  0.841
  0.01       1                  2000     0.92   0.835  0.847
  0.01       3                  100      0.913  0.947  0.729
  0.01       3                  200      0.918  0.889  0.81 
  0.01       3                  300      0.919  0.889  0.809
  0.01       3                  400      0.922  0.889  0.809
  0.01       3                  500      0.924  0.889  0.818
  0.01       3                  600      0.926  0.882  0.829
  0.01       3                  700      0.926  0.868  0.84 
  0.01       3                  800      0.928  0.875  0.845
  0.01       3                  900      0.928  0.868  0.847
  0.01       3                  1000     0.929  0.867  0.851
  0.01       3                  1100     0.93   0.865  0.854
  0.01       3                  1200     0.931  0.863  0.86 
  0.01       3                  1300     0.931  0.86   0.863
  0.01       3                  1400     0.932  0.863  0.863
  0.01       3                  1500     0.932  0.86   0.865
  0.01       3                  1600     0.932  0.856  0.869
  0.01       3                  1700     0.932  0.853  0.865
  0.01       3                  1800     0.933  0.853  0.864
  0.01       3                  1900     0.933  0.851  0.865
  0.01       3                  2000     0.933  0.851  0.867
  0.01       5                  100      0.914  0.947  0.726
  0.01       5                  200      0.917  0.895  0.802
  0.01       5                  300      0.925  0.904  0.804
  0.01       5                  400      0.928  0.895  0.83 
  0.01       5                  500      0.93   0.872  0.839
  0.01       5                  600      0.932  0.87   0.845
  0.01       5                  700      0.932  0.87   0.851
  0.01       5                  800      0.934  0.867  0.855
  0.01       5                  900      0.934  0.865  0.854
  0.01       5                  1000     0.935  0.86   0.861
  0.01       5                  1100     0.935  0.861  0.86 
  0.01       5                  1200     0.935  0.861  0.861
  0.01       5                  1300     0.935  0.86   0.865
  0.01       5                  1400     0.935  0.854  0.865
  0.01       5                  1500     0.935  0.856  0.868
  0.01       5                  1600     0.935  0.854  0.868
  0.01       5                  1700     0.935  0.849  0.872
  0.01       5                  1800     0.935  0.844  0.873
  0.01       5                  1900     0.934  0.846  0.873
  0.01       5                  2000     0.935  0.837  0.875
  0.01       7                  100      0.913  0.893  0.798
  0.01       7                  200      0.92   0.911  0.802
  0.01       7                  300      0.926  0.898  0.828
  0.01       7                  400      0.931  0.87   0.842
  0.01       7                  500      0.932  0.867  0.849
  0.01       7                  600      0.933  0.865  0.854
  0.01       7                  700      0.934  0.863  0.858
  0.01       7                  800      0.934  0.858  0.861
  0.01       7                  900      0.935  0.853  0.863
  0.01       7                  1000     0.935  0.849  0.865
  0.01       7                  1100     0.935  0.847  0.864
  0.01       7                  1200     0.935  0.84   0.867
  0.01       7                  1300     0.935  0.839  0.872
  0.01       7                  1400     0.935  0.837  0.875
  0.01       7                  1500     0.935  0.83   0.874
  0.01       7                  1600     0.935  0.83   0.875
  0.01       7                  1700     0.935  0.832  0.878
  0.01       7                  1800     0.935  0.826  0.878
  0.01       7                  1900     0.935  0.819  0.876
  0.01       7                  2000     0.935  0.825  0.876
  0.01       9                  100      0.919  0.895  0.796
  0.01       9                  200      0.927  0.902  0.818
  0.01       9                  300      0.93   0.872  0.844
  0.01       9                  400      0.933  0.863  0.854
  0.01       9                  500      0.935  0.86   0.859
  0.01       9                  600      0.935  0.863  0.861
  0.01       9                  700      0.936  0.858  0.865
  0.01       9                  800      0.936  0.851  0.866
  0.01       9                  900      0.936  0.846  0.87 
  0.01       9                  1000     0.936  0.849  0.869
  0.01       9                  1100     0.936  0.846  0.87 
  0.01       9                  1200     0.936  0.846  0.873
  0.01       9                  1300     0.936  0.842  0.875
  0.01       9                  1400     0.936  0.842  0.876
  0.01       9                  1500     0.936  0.837  0.878
  0.01       9                  1600     0.935  0.84   0.879
  0.01       9                  1700     0.935  0.835  0.877
  0.01       9                  1800     0.935  0.837  0.879
  0.01       9                  1900     0.935  0.832  0.878
  0.01       9                  2000     0.935  0.823  0.877
  0.1        1                  100      0.914  0.889  0.813
  0.1        1                  200      0.92   0.805  0.864
  0.1        1                  300      0.921  0.828  0.859
  0.1        1                  400      0.923  0.821  0.86 
  0.1        1                  500      0.922  0.816  0.865
  0.1        1                  600      0.923  0.809  0.869
  0.1        1                  700      0.922  0.819  0.87 
  0.1        1                  800      0.922  0.818  0.869
  0.1        1                  900      0.922  0.819  0.871
  0.1        1                  1000     0.921  0.823  0.869
  0.1        1                  1100     0.92   0.816  0.868
  0.1        1                  1200     0.918  0.814  0.869
  0.1        1                  1300     0.917  0.816  0.867
  0.1        1                  1400     0.918  0.811  0.866
  0.1        1                  1500     0.916  0.807  0.868
  0.1        1                  1600     0.915  0.807  0.867
  0.1        1                  1700     0.916  0.804  0.871
  0.1        1                  1800     0.914  0.807  0.869
  0.1        1                  1900     0.913  0.802  0.866
  0.1        1                  2000     0.913  0.802  0.865
  0.1        3                  100      0.925  0.856  0.847
  0.1        3                  200      0.932  0.839  0.871
  0.1        3                  300      0.933  0.835  0.874
  0.1        3                  400      0.932  0.83   0.877
  0.1        3                  500      0.93   0.821  0.88 
  0.1        3                  600      0.928  0.826  0.868
  0.1        3                  700      0.927  0.809  0.875
  0.1        3                  800      0.925  0.814  0.877
  0.1        3                  900      0.924  0.802  0.879
  0.1        3                  1000     0.923  0.804  0.878
  0.1        3                  1100     0.923  0.804  0.876
  0.1        3                  1200     0.923  0.8    0.873
  0.1        3                  1300     0.921  0.796  0.876
  0.1        3                  1400     0.922  0.793  0.877
  0.1        3                  1500     0.921  0.793  0.878
  0.1        3                  1600     0.921  0.791  0.877
  0.1        3                  1700     0.922  0.784  0.878
  0.1        3                  1800     0.92   0.775  0.883
  0.1        3                  1900     0.921  0.784  0.881
  0.1        3                  2000     0.918  0.786  0.881
  0.1        5                  100      0.934  0.86   0.868
  0.1        5                  200      0.935  0.846  0.87 
  0.1        5                  300      0.933  0.833  0.872
  0.1        5                  400      0.932  0.828  0.875
  0.1        5                  500      0.931  0.816  0.875
  0.1        5                  600      0.93   0.832  0.877
  0.1        5                  700      0.929  0.818  0.879
  0.1        5                  800      0.926  0.8    0.882
  0.1        5                  900      0.927  0.802  0.883
  0.1        5                  1000     0.926  0.796  0.878
  0.1        5                  1100     0.926  0.807  0.881
  0.1        5                  1200     0.925  0.807  0.875
  0.1        5                  1300     0.925  0.805  0.877
  0.1        5                  1400     0.924  0.796  0.875
  0.1        5                  1500     0.924  0.809  0.877
  0.1        5                  1600     0.924  0.807  0.878
  0.1        5                  1700     0.923  0.811  0.878
  0.1        5                  1800     0.923  0.811  0.878
  0.1        5                  1900     0.921  0.809  0.876
  0.1        5                  2000     0.922  0.809  0.871
  0.1        7                  100      0.934  0.84   0.875
  0.1        7                  200      0.931  0.809  0.875
  0.1        7                  300      0.93   0.796  0.879
  0.1        7                  400      0.928  0.793  0.877
  0.1        7                  500      0.926  0.804  0.873
  0.1        7                  600      0.924  0.784  0.872
  0.1        7                  700      0.922  0.782  0.877
  0.1        7                  800      0.923  0.789  0.873
  0.1        7                  900      0.924  0.796  0.873
  0.1        7                  1000     0.924  0.793  0.875
  0.1        7                  1100     0.924  0.793  0.872
  0.1        7                  1200     0.923  0.791  0.876
  0.1        7                  1300     0.925  0.782  0.877
  0.1        7                  1400     0.923  0.775  0.878
  0.1        7                  1500     0.923  0.767  0.877
  0.1        7                  1600     0.923  0.767  0.877
  0.1        7                  1700     0.922  0.772  0.878
  0.1        7                  1800     0.922  0.779  0.879
  0.1        7                  1900     0.922  0.768  0.878
  0.1        7                  2000     0.921  0.77   0.878
  0.1        9                  100      0.933  0.828  0.871
  0.1        9                  200      0.931  0.814  0.889
  0.1        9                  300      0.929  0.796  0.887
  0.1        9                  400      0.928  0.793  0.881
  0.1        9                  500      0.926  0.789  0.884
  0.1        9                  600      0.927  0.779  0.883
  0.1        9                  700      0.928  0.791  0.883
  0.1        9                  800      0.928  0.791  0.884
  0.1        9                  900      0.926  0.777  0.881
  0.1        9                  1000     0.925  0.772  0.886
  0.1        9                  1100     0.925  0.777  0.887
  0.1        9                  1200     0.925  0.772  0.887
  0.1        9                  1300     0.925  0.763  0.883
  0.1        9                  1400     0.924  0.772  0.883
  0.1        9                  1500     0.922  0.763  0.88 
  0.1        9                  1600     0.922  0.761  0.884
  0.1        9                  1700     0.922  0.76   0.883
  0.1        9                  1800     0.922  0.758  0.882
  0.1        9                  1900     0.923  0.76   0.884
  0.1        9                  2000     0.923  0.765  0.886

ROC was used to select the optimal model using  the largest value.
The final values used for the model were n.trees = 1300, interaction.depth =
 9 and shrinkage = 0.01. 
> 
> gbmFit$pred <- merge(gbmFit$pred,  gbmFit$bestTune)
> gbmCM <- confusionMatrix(gbmFit, norm = "none")
> gbmCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          480          123
  unsuccessful         90          864
                                          
               Accuracy : 0.8632          
                 95% CI : (0.8451, 0.8799)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.7088          
 Mcnemar's Test P-Value : 0.02834         
                                          
            Sensitivity : 0.8421          
            Specificity : 0.8754          
         Pos Pred Value : 0.7960          
         Neg Pred Value : 0.9057          
             Prevalence : 0.3661          
         Detection Rate : 0.3083          
   Detection Prevalence : 0.3873          
      Balanced Accuracy : 0.8587          
                                          
       'Positive' Class : successful      
                                          

> 
> gbmRoc <- roc(response = gbmFit$pred$obs,
+               predictor = gbmFit$pred$successful,
+               levels = rev(levels(gbmFit$pred$obs)))
> 
> set.seed(476)
> gbmFactorFit <- train(x = training[,factorPredictors], 
+                       y = training$Class,
+                       method = "gbm",
+                       tuneGrid = gbmGrid,
+                       verbose = FALSE,
+                       metric = "ROC",
+                       trControl = ctrl)
> gbmFactorFit
Stochastic Gradient Boosting 

8190 samples
1488 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  shrinkage  interaction.depth  n.trees  ROC    Sens   Spec 
  0.01       1                  100      0.881  0.658  0.797
  0.01       1                  200      0.886  0.872  0.821
  0.01       1                  300      0.887  0.882  0.824
  0.01       1                  400      0.888  0.886  0.8  
  0.01       1                  500      0.886  0.886  0.8  
  0.01       1                  600      0.883  0.888  0.799
  0.01       1                  700      0.883  0.888  0.799
  0.01       1                  800      0.881  0.888  0.799
  0.01       1                  900      0.883  0.884  0.8  
  0.01       1                  1000     0.884  0.884  0.8  
  0.01       1                  1100     0.885  0.884  0.801
  0.01       1                  1200     0.883  0.882  0.802
  0.01       1                  1300     0.88   0.882  0.8  
  0.01       1                  1400     0.877  0.882  0.801
  0.01       1                  1500     0.873  0.884  0.8  
  0.01       1                  1600     0.87   0.882  0.8  
  0.01       1                  1700     0.869  0.881  0.802
  0.01       1                  1800     0.867  0.884  0.804
  0.01       1                  1900     0.866  0.884  0.803
  0.01       1                  2000     0.864  0.884  0.803
  0.01       3                  100      0.907  0.884  0.792
  0.01       3                  200      0.909  0.886  0.793
  0.01       3                  300      0.905  0.886  0.795
  0.01       3                  400      0.902  0.884  0.799
  0.01       3                  500      0.894  0.884  0.796
  0.01       3                  600      0.888  0.884  0.797
  0.01       3                  700      0.881  0.884  0.8  
  0.01       3                  800      0.878  0.886  0.803
  0.01       3                  900      0.874  0.888  0.804
  0.01       3                  1000     0.873  0.886  0.802
  0.01       3                  1100     0.872  0.886  0.805
  0.01       3                  1200     0.872  0.884  0.806
  0.01       3                  1300     0.872  0.881  0.807
  0.01       3                  1400     0.872  0.882  0.806
  0.01       3                  1500     0.872  0.881  0.807
  0.01       3                  1600     0.872  0.882  0.809
  0.01       3                  1700     0.872  0.881  0.81 
  0.01       3                  1800     0.872  0.888  0.81 
  0.01       3                  1900     0.872  0.884  0.807
  0.01       3                  2000     0.873  0.881  0.81 
  0.01       5                  100      0.909  0.86   0.805
  0.01       5                  200      0.906  0.875  0.792
  0.01       5                  300      0.899  0.879  0.799
  0.01       5                  400      0.894  0.882  0.798
  0.01       5                  500      0.886  0.882  0.798
  0.01       5                  600      0.881  0.882  0.801
  0.01       5                  700      0.878  0.879  0.802
  0.01       5                  800      0.877  0.879  0.803
  0.01       5                  900      0.876  0.877  0.803
  0.01       5                  1000     0.876  0.879  0.806
  0.01       5                  1100     0.876  0.879  0.806
  0.01       5                  1200     0.876  0.881  0.809
  0.01       5                  1300     0.876  0.879  0.806
  0.01       5                  1400     0.876  0.882  0.806
  0.01       5                  1500     0.876  0.884  0.809
  0.01       5                  1600     0.876  0.881  0.806
  0.01       5                  1700     0.876  0.882  0.806
  0.01       5                  1800     0.876  0.882  0.809
  0.01       5                  1900     0.876  0.879  0.805
  0.01       5                  2000     0.876  0.882  0.804
  0.01       7                  100      0.917  0.882  0.78 
  0.01       7                  200      0.904  0.879  0.797
  0.01       7                  300      0.896  0.881  0.797
  0.01       7                  400      0.886  0.875  0.804
  0.01       7                  500      0.88   0.877  0.804
  0.01       7                  600      0.878  0.875  0.803
  0.01       7                  700      0.876  0.877  0.806
  0.01       7                  800      0.876  0.877  0.807
  0.01       7                  900      0.876  0.879  0.813
  0.01       7                  1000     0.876  0.879  0.811
  0.01       7                  1100     0.875  0.875  0.81 
  0.01       7                  1200     0.875  0.875  0.811
  0.01       7                  1300     0.875  0.874  0.811
  0.01       7                  1400     0.875  0.875  0.811
  0.01       7                  1500     0.875  0.875  0.811
  0.01       7                  1600     0.875  0.874  0.811
  0.01       7                  1700     0.875  0.875  0.807
  0.01       7                  1800     0.875  0.875  0.806
  0.01       7                  1900     0.875  0.875  0.807
  0.01       7                  2000     0.875  0.877  0.811
  0.01       9                  100      0.913  0.882  0.789
  0.01       9                  200      0.904  0.881  0.789
  0.01       9                  300      0.893  0.879  0.795
  0.01       9                  400      0.883  0.881  0.804
  0.01       9                  500      0.879  0.881  0.806
  0.01       9                  600      0.877  0.879  0.806
  0.01       9                  700      0.876  0.881  0.811
  0.01       9                  800      0.876  0.881  0.811
  0.01       9                  900      0.875  0.881  0.811
  0.01       9                  1000     0.875  0.875  0.814
  0.01       9                  1100     0.875  0.874  0.81 
  0.01       9                  1200     0.875  0.874  0.81 
  0.01       9                  1300     0.875  0.874  0.81 
  0.01       9                  1400     0.875  0.872  0.81 
  0.01       9                  1500     0.875  0.874  0.81 
  0.01       9                  1600     0.874  0.874  0.81 
  0.01       9                  1700     0.874  0.875  0.811
  0.01       9                  1800     0.874  0.875  0.81 
  0.01       9                  1900     0.874  0.879  0.809
  0.01       9                  2000     0.874  0.879  0.809
  0.1        1                  100      0.882  0.891  0.8  
  0.1        1                  200      0.865  0.888  0.801
  0.1        1                  300      0.857  0.891  0.798
  0.1        1                  400      0.858  0.882  0.802
  0.1        1                  500      0.858  0.884  0.801
  0.1        1                  600      0.859  0.888  0.801
  0.1        1                  700      0.858  0.884  0.804
  0.1        1                  800      0.857  0.886  0.799
  0.1        1                  900      0.857  0.884  0.797
  0.1        1                  1000     0.856  0.886  0.8  
  0.1        1                  1100     0.857  0.886  0.801
  0.1        1                  1200     0.856  0.889  0.801
  0.1        1                  1300     0.856  0.891  0.804
  0.1        1                  1400     0.855  0.886  0.801
  0.1        1                  1500     0.855  0.882  0.804
  0.1        1                  1600     0.855  0.884  0.807
  0.1        1                  1700     0.856  0.888  0.801
  0.1        1                  1800     0.855  0.882  0.811
  0.1        1                  1900     0.855  0.881  0.807
  0.1        1                  2000     0.855  0.888  0.811
  0.1        3                  100      0.875  0.886  0.799
  0.1        3                  200      0.873  0.882  0.813
  0.1        3                  300      0.872  0.891  0.81 
  0.1        3                  400      0.872  0.889  0.809
  0.1        3                  500      0.871  0.888  0.812
  0.1        3                  600      0.87   0.893  0.812
  0.1        3                  700      0.87   0.888  0.811
  0.1        3                  800      0.87   0.889  0.81 
  0.1        3                  900      0.869  0.881  0.813
  0.1        3                  1000     0.869  0.879  0.815
  0.1        3                  1100     0.869  0.879  0.814
  0.1        3                  1200     0.868  0.884  0.811
  0.1        3                  1300     0.868  0.872  0.812
  0.1        3                  1400     0.867  0.877  0.807
  0.1        3                  1500     0.865  0.874  0.811
  0.1        3                  1600     0.865  0.881  0.81 
  0.1        3                  1700     0.864  0.877  0.812
  0.1        3                  1800     0.865  0.879  0.812
  0.1        3                  1900     0.865  0.879  0.815
  0.1        3                  2000     0.864  0.87   0.817
  0.1        5                  100      0.873  0.879  0.807
  0.1        5                  200      0.872  0.891  0.8  
  0.1        5                  300      0.871  0.875  0.814
  0.1        5                  400      0.87   0.882  0.806
  0.1        5                  500      0.868  0.879  0.806
  0.1        5                  600      0.869  0.87   0.807
  0.1        5                  700      0.868  0.875  0.809
  0.1        5                  800      0.866  0.881  0.811
  0.1        5                  900      0.865  0.879  0.805
  0.1        5                  1000     0.865  0.879  0.806
  0.1        5                  1100     0.864  0.868  0.81 
  0.1        5                  1200     0.863  0.877  0.807
  0.1        5                  1300     0.863  0.879  0.806
  0.1        5                  1400     0.863  0.875  0.805
  0.1        5                  1500     0.862  0.879  0.802
  0.1        5                  1600     0.862  0.872  0.806
  0.1        5                  1700     0.862  0.879  0.809
  0.1        5                  1800     0.862  0.877  0.807
  0.1        5                  1900     0.862  0.875  0.809
  0.1        5                  2000     0.861  0.879  0.803
  0.1        7                  100      0.876  0.893  0.809
  0.1        7                  200      0.873  0.879  0.804
  0.1        7                  300      0.87   0.882  0.799
  0.1        7                  400      0.868  0.882  0.798
  0.1        7                  500      0.864  0.879  0.8  
  0.1        7                  600      0.863  0.879  0.804
  0.1        7                  700      0.863  0.87   0.802
  0.1        7                  800      0.863  0.872  0.802
  0.1        7                  900      0.863  0.874  0.801
  0.1        7                  1000     0.862  0.868  0.8  
  0.1        7                  1100     0.861  0.863  0.794
  0.1        7                  1200     0.862  0.861  0.793
  0.1        7                  1300     0.861  0.863  0.796
  0.1        7                  1400     0.86   0.861  0.797
  0.1        7                  1500     0.86   0.867  0.796
  0.1        7                  1600     0.859  0.861  0.799
  0.1        7                  1700     0.859  0.87   0.797
  0.1        7                  1800     0.86   0.863  0.801
  0.1        7                  1900     0.86   0.868  0.799
  0.1        7                  2000     0.859  0.858  0.796
  0.1        9                  100      0.872  0.874  0.811
  0.1        9                  200      0.868  0.872  0.801
  0.1        9                  300      0.866  0.872  0.806
  0.1        9                  400      0.865  0.868  0.8  
  0.1        9                  500      0.863  0.872  0.801
  0.1        9                  600      0.861  0.879  0.803
  0.1        9                  700      0.861  0.874  0.8  
  0.1        9                  800      0.861  0.87   0.801
  0.1        9                  900      0.861  0.874  0.796
  0.1        9                  1000     0.86   0.868  0.795
  0.1        9                  1100     0.86   0.874  0.798
  0.1        9                  1200     0.859  0.868  0.797
  0.1        9                  1300     0.859  0.868  0.796
  0.1        9                  1400     0.859  0.87   0.797
  0.1        9                  1500     0.86   0.874  0.796
  0.1        9                  1600     0.859  0.868  0.796
  0.1        9                  1700     0.858  0.874  0.796
  0.1        9                  1800     0.86   0.874  0.799
  0.1        9                  1900     0.859  0.877  0.796
  0.1        9                  2000     0.859  0.879  0.795

ROC was used to select the optimal model using  the largest value.
The final values used for the model were n.trees = 100, interaction.depth =
 7 and shrinkage = 0.01. 
> 
> gbmFactorFit$pred <- merge(gbmFactorFit$pred,  gbmFactorFit$bestTune)
> gbmFactorCM <- confusionMatrix(gbmFactorFit, norm = "none")
> gbmFactorCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          503          217
  unsuccessful         67          770
                                          
               Accuracy : 0.8176          
                 95% CI : (0.7975, 0.8365)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6277          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.8825          
            Specificity : 0.7801          
         Pos Pred Value : 0.6986          
         Neg Pred Value : 0.9200          
             Prevalence : 0.3661          
         Detection Rate : 0.3231          
   Detection Prevalence : 0.4624          
      Balanced Accuracy : 0.8313          
                                          
       'Positive' Class : successful      
                                          

> 
> gbmFactorRoc <- roc(response = gbmFactorFit$pred$obs,
+                     predictor = gbmFactorFit$pred$successful,
+                     levels = rev(levels(gbmFactorFit$pred$obs)))
> 
> gbmROCRange <- extendrange(cbind(gbmFactorFit$results$ROC,gbmFit$results$ROC))
> 
> plot(gbmFactorFit, ylim = gbmROCRange, 
+      auto.key = list(columns = 4, lines = TRUE))
> 
> 
> plot(gbmFit, ylim = gbmROCRange, 
+      auto.key = list(columns = 4, lines = TRUE))
> 
> 
> plot(treebagRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = treebagFit$pred$obs, predictor = treebagFit$pred$successful,     levels = rev(levels(treebagFit$pred$obs)))

Data: treebagFit$pred$successful in 987 controls (treebagFit$pred$obs unsuccessful) < 570 cases (treebagFit$pred$obs successful).
Area under the curve: 0.9205
> plot(rpartRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = rpartFit$pred$obs, predictor = rpartFit$pred$successful,     levels = rev(levels(rpartFit$pred$obs)))

Data: rpartFit$pred$successful in 29610 controls (rpartFit$pred$obs unsuccessful) < 17100 cases (rpartFit$pred$obs successful).
Area under the curve: 0.8915
> plot(j48FactorRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = j48FactorFit$pred$obs, predictor = j48FactorFit$pred$successful,     levels = rev(levels(j48FactorFit$pred$obs)))

Data: j48FactorFit$pred$successful in 987 controls (j48FactorFit$pred$obs unsuccessful) < 570 cases (j48FactorFit$pred$obs successful).
Area under the curve: 0.8353
> plot(rfFactorRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = rfFactorFit$pred$obs, predictor = rfFactorFit$pred$successful,     levels = rev(levels(rfFactorFit$pred$obs)))

Data: rfFactorFit$pred$successful in 8883 controls (rfFactorFit$pred$obs unsuccessful) < 5130 cases (rfFactorFit$pred$obs successful).
Area under the curve: 0.9049
> plot(gbmRoc, type = "s", print.thres = c(.5), print.thres.pch = 3, 
+      print.thres.pattern = "", print.thres.cex = 1.2,
+      add = TRUE, col = "red", print.thres.col = "red", legacy.axes = TRUE)

Call:
roc.default(response = gbmFit$pred$obs, predictor = gbmFit$pred$successful,     levels = rev(levels(gbmFit$pred$obs)))

Data: gbmFit$pred$successful in 987 controls (gbmFit$pred$obs unsuccessful) < 570 cases (gbmFit$pred$obs successful).
Area under the curve: 0.9361
> plot(gbmFactorRoc, type = "s", print.thres = c(.5), print.thres.pch = 16, 
+      legacy.axes = TRUE, print.thres.pattern = "", print.thres.cex = 1.2,
+      add = TRUE)

Call:
roc.default(response = gbmFactorFit$pred$obs, predictor = gbmFactorFit$pred$successful,     levels = rev(levels(gbmFactorFit$pred$obs)))

Data: gbmFactorFit$pred$successful in 987 controls (gbmFactorFit$pred$obs unsuccessful) < 570 cases (gbmFactorFit$pred$obs successful).
Area under the curve: 0.9168
> legend(.75, .2,
+        c("Grouped Categories", "Independent Categories"),
+        lwd = c(1, 1),
+        col = c("black", "red"),
+        pch = c(16, 3))
> 
> ################################################################################
> ### Section 14.5 C5.0
> 
> c50Grid <- expand.grid(trials = c(1:9, (1:10)*10),
+                        model = c("tree", "rules"),
+                        winnow = c(TRUE, FALSE))
> set.seed(476)
> c50FactorFit <- train(training[,factorPredictors], training$Class,
+                       method = "C5.0",
+                       tuneGrid = c50Grid,
+                       verbose = FALSE,
+                       metric = "ROC",
+                       trControl = ctrl)
Loading required package: C50
> c50FactorFit
C5.0 

8190 samples
1488 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  model  winnow  trials  ROC    Sens   Spec 
  rules  FALSE   1       0.877  0.886  0.796
  rules  FALSE   2       0.886  0.621  0.919
  rules  FALSE   3       0.9    0.782  0.844
  rules  FALSE   4       0.905  0.816  0.858
  rules  FALSE   5       0.907  0.802  0.846
  rules  FALSE   6       0.917  0.832  0.841
  rules  FALSE   7       0.922  0.796  0.873
  rules  FALSE   8       0.924  0.847  0.866
  rules  FALSE   9       0.923  0.832  0.867
  rules  FALSE   10      0.92   0.818  0.87 
  rules  FALSE   20      0.934  0.823  0.888
  rules  FALSE   30      0.937  0.844  0.875
  rules  FALSE   40      0.938  0.844  0.88 
  rules  FALSE   50      0.939  0.835  0.88 
  rules  FALSE   60      0.94   0.842  0.882
  rules  FALSE   70      0.939  0.839  0.884
  rules  FALSE   80      0.941  0.847  0.886
  rules  FALSE   90      0.941  0.842  0.884
  rules  FALSE   100     0.942  0.849  0.888
  rules  TRUE    1       0.859  0.886  0.81 
  rules  TRUE    2       0.892  0.784  0.851
  rules  TRUE    3       0.895  0.796  0.85 
  rules  TRUE    4       0.914  0.811  0.862
  rules  TRUE    5       0.919  0.828  0.865
  rules  TRUE    6       0.923  0.795  0.875
  rules  TRUE    7       0.927  0.856  0.854
  rules  TRUE    8       0.93   0.818  0.876
  rules  TRUE    9       0.931  0.846  0.867
  rules  TRUE    10      0.932  0.854  0.869
  rules  TRUE    20      0.932  0.854  0.869
  rules  TRUE    30      0.933  0.849  0.869
  rules  TRUE    40      0.935  0.856  0.871
  rules  TRUE    50      0.936  0.856  0.87 
  rules  TRUE    60      0.936  0.856  0.868
  rules  TRUE    70      0.936  0.868  0.867
  rules  TRUE    80      0.937  0.858  0.873
  rules  TRUE    90      0.937  0.867  0.869
  rules  TRUE    100     0.937  0.87   0.874
  tree   FALSE   1       0.906  0.874  0.832
  tree   FALSE   2       0.903  0.886  0.838
  tree   FALSE   3       0.908  0.809  0.853
  tree   FALSE   4       0.908  0.84   0.859
  tree   FALSE   5       0.909  0.818  0.835
  tree   FALSE   6       0.908  0.835  0.844
  tree   FALSE   7       0.909  0.825  0.835
  tree   FALSE   8       0.913  0.842  0.844
  tree   FALSE   9       0.921  0.847  0.839
  tree   FALSE   10      0.921  0.847  0.838
  tree   FALSE   20      0.929  0.853  0.855
  tree   FALSE   30      0.933  0.858  0.868
  tree   FALSE   40      0.934  0.853  0.875
  tree   FALSE   50      0.934  0.847  0.872
  tree   FALSE   60      0.935  0.86   0.872
  tree   FALSE   70      0.935  0.854  0.872
  tree   FALSE   80      0.935  0.856  0.867
  tree   FALSE   90      0.935  0.853  0.866
  tree   FALSE   100     0.936  0.847  0.867
  tree   TRUE    1       0.904  0.877  0.826
  tree   TRUE    2       0.895  0.874  0.85 
  tree   TRUE    3       0.91   0.856  0.835
  tree   TRUE    4       0.911  0.826  0.83 
  tree   TRUE    5       0.912  0.816  0.848
  tree   TRUE    6       0.918  0.856  0.852
  tree   TRUE    7       0.919  0.833  0.856
  tree   TRUE    8       0.92   0.837  0.854
  tree   TRUE    9       0.921  0.83   0.854
  tree   TRUE    10      0.923  0.833  0.846
  tree   TRUE    20      0.929  0.856  0.863
  tree   TRUE    30      0.932  0.867  0.86 
  tree   TRUE    40      0.933  0.865  0.867
  tree   TRUE    50      0.934  0.868  0.873
  tree   TRUE    60      0.935  0.865  0.869
  tree   TRUE    70      0.934  0.877  0.854
  tree   TRUE    80      0.935  0.865  0.86 
  tree   TRUE    90      0.934  0.861  0.869
  tree   TRUE    100     0.935  0.872  0.866

ROC was used to select the optimal model using  the largest value.
The final values used for the model were trials = 100, model = rules and
 winnow = FALSE. 
> 
> c50FactorFit$pred <- merge(c50FactorFit$pred,  c50FactorFit$bestTune)
> c50FactorCM <- confusionMatrix(c50FactorFit, norm = "none")
> c50FactorCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          484          111
  unsuccessful         86          876
                                          
               Accuracy : 0.8735          
                 95% CI : (0.8559, 0.8896)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.7299          
 Mcnemar's Test P-Value : 0.08728         
                                          
            Sensitivity : 0.8491          
            Specificity : 0.8875          
         Pos Pred Value : 0.8134          
         Neg Pred Value : 0.9106          
             Prevalence : 0.3661          
         Detection Rate : 0.3109          
   Detection Prevalence : 0.3821          
      Balanced Accuracy : 0.8683          
                                          
       'Positive' Class : successful      
                                          

> 
> c50FactorRoc <- roc(response = c50FactorFit$pred$obs,
+                     predictor = c50FactorFit$pred$successful,
+                     levels = rev(levels(c50FactorFit$pred$obs)))
> 
> set.seed(476)
> c50Fit <- train(training[,fullSet], training$Class,
+                 method = "C5.0",
+                 tuneGrid = c50Grid,
+                 metric = "ROC",
+                 verbose = FALSE,
+                 trControl = ctrl)
> c50Fit
C5.0 

8190 samples
1070 predictors
   2 classes: 'successful', 'unsuccessful' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%) 

Summary of sample sizes: 6633 

Resampling results across tuning parameters:

  model  winnow  trials  ROC    Sens   Spec 
  rules  FALSE   1       0.893  0.768  0.87 
  rules  FALSE   2       0.877  0.872  0.831
  rules  FALSE   3       0.896  0.747  0.874
  rules  FALSE   4       0.901  0.823  0.858
  rules  FALSE   5       0.901  0.753  0.883
  rules  FALSE   6       0.914  0.851  0.855
  rules  FALSE   7       0.919  0.805  0.87 
  rules  FALSE   8       0.919  0.839  0.859
  rules  FALSE   9       0.924  0.833  0.872
  rules  FALSE   10      0.921  0.839  0.867
  rules  FALSE   20      0.928  0.846  0.866
  rules  FALSE   30      0.932  0.842  0.868
  rules  FALSE   40      0.934  0.84   0.872
  rules  FALSE   50      0.931  0.826  0.872
  rules  FALSE   60      0.933  0.842  0.872
  rules  FALSE   70      0.934  0.839  0.869
  rules  FALSE   80      0.935  0.84   0.873
  rules  FALSE   90      0.935  0.832  0.872
  rules  FALSE   100     0.935  0.844  0.871
  rules  TRUE    1       0.85   0.847  0.847
  rules  TRUE    2       0.882  0.868  0.829
  rules  TRUE    3       0.899  0.775  0.868
  rules  TRUE    4       0.91   0.854  0.834
  rules  TRUE    5       0.918  0.821  0.854
  rules  TRUE    6       0.915  0.839  0.839
  rules  TRUE    7       0.917  0.786  0.867
  rules  TRUE    8       0.921  0.842  0.853
  rules  TRUE    9       0.917  0.814  0.865
  rules  TRUE    10      0.919  0.825  0.862
  rules  TRUE    20      0.927  0.84   0.858
  rules  TRUE    30      0.923  0.809  0.869
  rules  TRUE    40      0.927  0.84   0.866
  rules  TRUE    50      0.927  0.844  0.862
  rules  TRUE    60      0.928  0.839  0.867
  rules  TRUE    70      0.928  0.837  0.866
  rules  TRUE    80      0.929  0.833  0.864
  rules  TRUE    90      0.93   0.823  0.873
  rules  TRUE    100     0.931  0.825  0.872
  tree   FALSE   1       0.9    0.753  0.878
  tree   FALSE   2       0.874  0.805  0.858
  tree   FALSE   3       0.908  0.758  0.872
  tree   FALSE   4       0.914  0.832  0.852
  tree   FALSE   5       0.921  0.814  0.857
  tree   FALSE   6       0.916  0.826  0.851
  tree   FALSE   7       0.921  0.805  0.869
  tree   FALSE   8       0.923  0.835  0.852
  tree   FALSE   9       0.924  0.809  0.866
  tree   FALSE   10      0.924  0.825  0.864
  tree   FALSE   20      0.932  0.823  0.873
  tree   FALSE   30      0.932  0.819  0.88 
  tree   FALSE   40      0.932  0.828  0.881
  tree   FALSE   50      0.932  0.83   0.878
  tree   FALSE   60      0.933  0.842  0.874
  tree   FALSE   70      0.934  0.842  0.87 
  tree   FALSE   80      0.934  0.835  0.868
  tree   FALSE   90      0.934  0.837  0.872
  tree   FALSE   100     0.935  0.842  0.875
  tree   TRUE    1       0.905  0.837  0.854
  tree   TRUE    2       0.877  0.782  0.851
  tree   TRUE    3       0.896  0.753  0.864
  tree   TRUE    4       0.902  0.774  0.862
  tree   TRUE    5       0.908  0.791  0.852
  tree   TRUE    6       0.908  0.805  0.856
  tree   TRUE    7       0.914  0.798  0.868
  tree   TRUE    8       0.915  0.795  0.865
  tree   TRUE    9       0.916  0.782  0.867
  tree   TRUE    10      0.919  0.809  0.864
  tree   TRUE    20      0.919  0.807  0.874
  tree   TRUE    30      0.926  0.804  0.873
  tree   TRUE    40      0.927  0.809  0.877
  tree   TRUE    50      0.928  0.814  0.873
  tree   TRUE    60      0.926  0.809  0.872
  tree   TRUE    70      0.928  0.812  0.871
  tree   TRUE    80      0.929  0.816  0.869
  tree   TRUE    90      0.929  0.816  0.872
  tree   TRUE    100     0.929  0.818  0.869

ROC was used to select the optimal model using  the largest value.
The final values used for the model were trials = 90, model = rules and
 winnow = FALSE. 
> 
> c50Fit$pred <- merge(c50Fit$pred,  c50Fit$bestTune)
> c50CM <- confusionMatrix(c50Fit, norm = "none")
> c50CM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix 

(entries are un-normalized counts)
 
Confusion Matrix and Statistics

              Reference
Prediction     successful unsuccessful
  successful          474          126
  unsuccessful         96          861
                                          
               Accuracy : 0.8574          
                 95% CI : (0.8391, 0.8744)
    No Information Rate : 0.6339          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.6962          
 Mcnemar's Test P-Value : 0.05161         
                                          
            Sensitivity : 0.8316          
            Specificity : 0.8723          
         Pos Pred Value : 0.7900          
         Neg Pred Value : 0.8997          
             Prevalence : 0.3661          
         Detection Rate : 0.3044          
   Detection Prevalence : 0.3854          
      Balanced Accuracy : 0.8520          
                                          
       'Positive' Class : successful      
                                          

> 
> c50Roc <- roc(response = c50Fit$pred$obs,
+               predictor = c50Fit$pred$successful,
+               levels = rev(levels(c50Fit$pred$obs)))
> 
> update(plot(c50FactorFit), ylab = "ROC AUC (2008 Hold-Out Data)")
> 
> 
> plot(treebagRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = treebagFit$pred$obs, predictor = treebagFit$pred$successful,     levels = rev(levels(treebagFit$pred$obs)))

Data: treebagFit$pred$successful in 987 controls (treebagFit$pred$obs unsuccessful) < 570 cases (treebagFit$pred$obs successful).
Area under the curve: 0.9205
> plot(rpartRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = rpartFit$pred$obs, predictor = rpartFit$pred$successful,     levels = rev(levels(rpartFit$pred$obs)))

Data: rpartFit$pred$successful in 29610 controls (rpartFit$pred$obs unsuccessful) < 17100 cases (rpartFit$pred$obs successful).
Area under the curve: 0.8915
> plot(j48FactorRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = j48FactorFit$pred$obs, predictor = j48FactorFit$pred$successful,     levels = rev(levels(j48FactorFit$pred$obs)))

Data: j48FactorFit$pred$successful in 987 controls (j48FactorFit$pred$obs unsuccessful) < 570 cases (j48FactorFit$pred$obs successful).
Area under the curve: 0.8353
> plot(rfFactorRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)

Call:
roc.default(response = rfFactorFit$pred$obs, predictor = rfFactorFit$pred$successful,     levels = rev(levels(rfFactorFit$pred$obs)))

Data: rfFactorFit$pred$successful in 8883 controls (rfFactorFit$pred$obs unsuccessful) < 5130 cases (rfFactorFit$pred$obs successful).
Area under the curve: 0.9049
> plot(gbmRoc, type = "s",  col = rgb(.2, .2, .2, .2), add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = gbmFit$pred$obs, predictor = gbmFit$pred$successful,     levels = rev(levels(gbmFit$pred$obs)))

Data: gbmFit$pred$successful in 987 controls (gbmFit$pred$obs unsuccessful) < 570 cases (gbmFit$pred$obs successful).
Area under the curve: 0.9361
> plot(c50Roc, type = "s", print.thres = c(.5), print.thres.pch = 3, 
+      print.thres.pattern = "", print.thres.cex = 1.2,
+      add = TRUE, col = "red", print.thres.col = "red", legacy.axes = TRUE)

Call:
roc.default(response = c50Fit$pred$obs, predictor = c50Fit$pred$successful,     levels = rev(levels(c50Fit$pred$obs)))

Data: c50Fit$pred$successful in 987 controls (c50Fit$pred$obs unsuccessful) < 570 cases (c50Fit$pred$obs successful).
Area under the curve: 0.9352
> plot(c50FactorRoc, type = "s", print.thres = c(.5), print.thres.pch = 16, 
+      print.thres.pattern = "", print.thres.cex = 1.2,
+      add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = c50FactorFit$pred$obs, predictor = c50FactorFit$pred$successful,     levels = rev(levels(c50FactorFit$pred$obs)))

Data: c50FactorFit$pred$successful in 987 controls (c50FactorFit$pred$obs unsuccessful) < 570 cases (c50FactorFit$pred$obs successful).
Area under the curve: 0.942
> legend(.75, .2,
+        c("Grouped Categories", "Independent Categories"),
+        lwd = c(1, 1),
+        col = c("black", "red"),
+        pch = c(16, 3))
> 
> ################################################################################
> ### Section 14.7 Comparing Two Encodings of Categorical Predictors
> 
> ## Pull the hold-out results from each model and merge
> 
> rp1 <- caret:::getTrainPerf(rpartFit)
> names(rp1) <- gsub("Train", "Independent", names(rp1))
> rp2 <- caret:::getTrainPerf(rpartFactorFit)
> rp2$Label <- "CART"
> names(rp2) <- gsub("Train", "Grouped", names(rp2))
> rp <- cbind(rp1, rp2)
> 
> j481 <- caret:::getTrainPerf(j48Fit)
> names(j481) <- gsub("Train", "Independent", names(j481))
> j482 <- caret:::getTrainPerf(j48FactorFit)
> j482$Label <- "J48"
> names(j482) <- gsub("Train", "Grouped", names(j482))
> j48 <- cbind(j481, j482)
> 
> part1 <- caret:::getTrainPerf(partFit)
> names(part1) <- gsub("Train", "Independent", names(part1))
> part2 <- caret:::getTrainPerf(partFactorFit)
> part2$Label <- "PART"
> names(part2) <- gsub("Train", "Grouped", names(part2))
> part <- cbind(part1, part2)
> 
> tb1 <- caret:::getTrainPerf(treebagFit)
> names(tb1) <- gsub("Train", "Independent", names(tb1))
> tb2 <- caret:::getTrainPerf(treebagFactorFit)
> tb2$Label <- "Bagged Tree"
> names(tb2) <- gsub("Train", "Grouped", names(tb2))
> tb <- cbind(tb1, tb2)
> 
> rf1 <- caret:::getTrainPerf(rfFit)
> names(rf1) <- gsub("Train", "Independent", names(rf1))
> rf2 <- caret:::getTrainPerf(rfFactorFit)
> rf2$Label <- "Random Forest"
> names(rf2) <- gsub("Train", "Grouped", names(rf2))
> rf <- cbind(rf1, rf2)
> 
> gbm1 <- caret:::getTrainPerf(gbmFit)
> names(gbm1) <- gsub("Train", "Independent", names(gbm1))
> gbm2 <- caret:::getTrainPerf(gbmFactorFit)
> gbm2$Label <- "Boosted Tree"
> names(gbm2) <- gsub("Train", "Grouped", names(gbm2))
> bst <- cbind(gbm1, gbm2)
> 
> 
> c501 <- caret:::getTrainPerf(c50Fit)
> names(c501) <- gsub("Train", "Independent", names(c501))
> c502 <- caret:::getTrainPerf(c50FactorFit)
> c502$Label <- "C5.0"
> names(c502) <- gsub("Train", "Grouped", names(c502))
> c5 <- cbind(c501, c502)
> 
> 
> trainPerf <- rbind(rp, j48, part, tb, rf, bst, c5)
> 
> library(lattice)
> library(reshape2)
> trainPerf <- melt(trainPerf)
Using method, method, Label as id variables
> trainPerf$metric <- "ROC"
> trainPerf$metric[grepl("Sens", trainPerf$variable)] <- "Sensitivity"
> trainPerf$metric[grepl("Spec", trainPerf$variable)] <- "Specificity"
> trainPerf$model <- "Grouped"
> trainPerf$model[grepl("Independent", trainPerf$variable)] <- "Independent"
> 
> trainPerf <- melt(trainPerf)
Using method, method.1, Label, variable, metric, model as id variables
> trainPerf$metric <- "ROC"
> trainPerf$metric[grepl("Sens", trainPerf$variable)] <- "Sensitivity"
> trainPerf$metric[grepl("Spec", trainPerf$variable)] <- "Specificity"
> trainPerf$model <- "Independent"
> trainPerf$model[grepl("Grouped", trainPerf$variable)] <- "Grouped"
> trainPerf$Label <- factor(trainPerf$Label,
+                           levels = rev(c("CART", "Cond. Trees", "J48", "Ripper",
+                                          "PART", "Bagged Tree", "Random Forest", 
+                                          "Boosted Tree", "C5.0")))
> 
> dotplot(Label ~ value|metric,
+         data = trainPerf,
+         groups = model,
+         horizontal = TRUE,
+         auto.key = list(columns = 2),
+         between = list(x = 1),
+         xlab = "")
> 
> 
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] C

attached base packages:
 [1] parallel  splines   grid      stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] reshape2_1.2.2     C50_0.1.0-15       gbm_2.1            randomForest_4.6-7
 [5] ipred_0.9-1        prodlim_1.3.7      nnet_7.3-6         survival_2.37-4   
 [9] MASS_7.3-26        RWeka_0.4-17       e1071_1.6-1        class_7.3-7       
[13] partykit_0.1-5     pROC_1.5.4         plyr_1.8           rpart_4.1-1       
[17] caret_6.0-22       ggplot2_0.9.3.1    lattice_0.20-15   

loaded via a namespace (and not attached):
 [1] KernSmooth_2.23-10 RColorBrewer_1.0-5 RWekajars_3.7.9-1  car_2.0-17        
 [5] codetools_0.2-8    colorspace_1.2-2   compiler_3.0.1     dichromat_2.0-0   
 [9] digest_0.6.3       foreach_1.4.0      gtable_0.1.2       iterators_1.0.6   
[13] labeling_0.1       munsell_0.4        proto_0.3-10       rJava_0.9-4       
[17] scales_0.2.3       stringr_0.6.2     
> 
> q("no")
> proc.time()
      user     system    elapsed 
208496.296    776.829 209791.456 
In [77]:
%%R -w 600 -h 600

## runChapterScript(14)

##        user     system    elapsed 
##  208496.296    776.829 209791.456
NULL
In [88]:
%%R

showChapterScript(16)
NULL
In [79]:
%%R

showChapterOutput(16)
R Information
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com) 
> ###
> ### Chapter 16: Remedies for Severe Class Imbalance
> ###
> ### Required packages: AppliedPredictiveModeling, caret, C50, earth, DMwR, 
> ###                    DVD, kernlab,  mda, pROC, randomForest, rpart
> ###
> ### Data used: The insurance data from the DWD package. 
> ###
> ### Notes: 
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be 
> ### syntax differences that occur over time as packages evolve. These files 
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in 
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
> 
> ################################################################################
> ### Section 16.1 Case Study: Predicting Caravan Policy Ownership
> 
> library(DWD)
Loading required package: Matrix
Loading required package: lattice
> data(ticdata)
> 
> ### Some of the predictor names and levels have characters that would results in
> ### illegal variable names. We convert then to more generic names and treat the
> ### ordered factors as nominal (i.e. unordered) factors. 
> 
> isOrdered <- unlist(lapply(ticdata, function(x) any(class(x) == "ordered")))
> 
> recodeLevels <- function(x)
+   {
+     x <- gsub("f ", "", as.character(x))
+     x <- gsub(" - ", "_to_", x)
+     x <- gsub("-", "_to_", x)
+     x <- gsub("%", "", x)
+     x <- gsub("?", "Unk", x, fixed = TRUE)
+     x <- gsub("[,'\\(\\)]", "", x)
+     x <- gsub(" ", "_", x)
+     factor(paste("_", x, sep = ""))
+   }
> 
> convertCols <- c("STYPE", "MGEMLEEF", "MOSHOOFD",
+                  names(isOrdered)[isOrdered])
> 
> for(i in convertCols) ticdata[,i] <- factor(gsub(" ", "0",format(as.numeric(ticdata[,i]))))
> 
> ticdata$CARAVAN <- factor(as.character(ticdata$CARAVAN),
+                           levels = rev(levels(ticdata$CARAVAN)))
> 
> ### Split the data into three sets: training, test and evaluation. 
> library(caret)
Loading required package: ggplot2
> 
> set.seed(156)
> 
> split1 <- createDataPartition(ticdata$CARAVAN, p = .7)[[1]]
> 
> other     <- ticdata[-split1,]
> training  <- ticdata[ split1,]
> 
> set.seed(934)
> 
> split2 <- createDataPartition(other$CARAVAN, p = 1/3)[[1]]
> 
> evaluation  <- other[ split2,]
> testing     <- other[-split2,]
> 
> predictors <- names(training)[names(training) != "CARAVAN"]
> 
> testResults <- data.frame(CARAVAN = testing$CARAVAN)
> evalResults <- data.frame(CARAVAN = evaluation$CARAVAN)
> 
> trainingInd <- data.frame(model.matrix(CARAVAN ~ ., data = training))[,-1]
> evaluationInd <- data.frame(model.matrix(CARAVAN ~ ., data = evaluation))[,-1]
> testingInd <- data.frame(model.matrix(CARAVAN ~ ., data = testing))[,-1]
> 
> trainingInd$CARAVAN <- training$CARAVAN
> evaluationInd$CARAVAN <- evaluation$CARAVAN
> testingInd$CARAVAN <- testing$CARAVAN
> 
> isNZV <- nearZeroVar(trainingInd)
> noNZVSet <- names(trainingInd)[-isNZV]
> 
> testResults <- data.frame(CARAVAN = testing$CARAVAN)
> evalResults <- data.frame(CARAVAN = evaluation$CARAVAN)
> 
> ################################################################################
> ### Section 16.2 The Effect of Class Imbalance
> 
> ### These functions are used to measure performance
> 
> fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
> fourStats <- function (data, lev = levels(data$obs), model = NULL)
+ {
+ 
+   accKapp <- postResample(data[, "pred"], data[, "obs"])
+   out <- c(accKapp,
+            sensitivity(data[, "pred"], data[, "obs"], lev[1]),
+            specificity(data[, "pred"], data[, "obs"], lev[2]))
+   names(out)[3:4] <- c("Sens", "Spec")
+   out
+ }
> 
> ctrl <- trainControl(method = "cv",
+                      classProbs = TRUE,
+                      summaryFunction = fiveStats)
> 
> ctrlNoProb <- ctrl
> ctrlNoProb$summaryFunction <- fourStats
> ctrlNoProb$classProbs <- FALSE
> 
> 
> set.seed(1410)
> rfFit <- train(CARAVAN ~ ., data = trainingInd,
+                method = "rf",
+                trControl = ctrl,
+                ntree = 1500,
+                tuneLength = 5,
+                metric = "ROC")
Loading required package: randomForest
randomForest 4.6-7
Type rfNews() to see new features/changes/bug fixes.
Loading required package: pROC
Loading required package: plyr
Type 'citation("pROC")' for a citation.

Attaching package: ‘pROC’

The following object is masked from ‘package:stats’:

    cov, smooth, var

Loading required package: class
> rfFit
Random Forest 

6877 samples
 503 predictors
   2 classes: 'insurance', 'noinsurance' 

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 6190, 6190, 6188, 6189, 6189, 6190, ... 

Resampling results across tuning parameters:

  mtry  ROC    Sens    Spec   Accuracy  Kappa      ROC SD  Sens SD  Spec SD
  2     0.608  0       1      0.94      0          0.0863  0        0      
  7     0.669  0       1      0.94      -0.000285  0.0335  0        0.00049
  31    0.689  0.0146  0.993  0.934     0.0134     0.0376  0.0171   0.00373
  126   0.696  0.0292  0.986  0.928     0.0233     0.0387  0.0193   0.0042 
  502   0.688  0.0365  0.98   0.923     0.0233     0.042   0.0208   0.00392
  Accuracy SD  Kappa SD
  0.000422     0       
  0.000602     0.000901
  0.00447      0.0341  
  0.00475      0.0338  
  0.00445      0.0335  

ROC was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 126. 
> 
> evalResults$RF <- predict(rfFit, evaluationInd, type = "prob")[,1]
> testResults$RF <- predict(rfFit, testingInd, type = "prob")[,1]
> rfROC <- roc(evalResults$CARAVAN, evalResults$RF,
+              levels = rev(levels(evalResults$CARAVAN)))
> rfROC

Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$RF,     levels = rev(levels(evalResults$CARAVAN)))

Data: evalResults$RF in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7596
> 
> rfEvalCM <- confusionMatrix(predict(rfFit, evaluationInd), evalResults$CARAVAN)
> rfEvalCM
Confusion Matrix and Statistics

             Reference
Prediction    insurance noinsurance
  insurance           4           9
  noinsurance        55         915
                                          
               Accuracy : 0.9349          
                 95% CI : (0.9176, 0.9495)
    No Information Rate : 0.94            
    P-Value [Acc > NIR] : 0.7727          
                                          
                  Kappa : 0.0914          
 Mcnemar's Test P-Value : 1.855e-08       
                                          
            Sensitivity : 0.067797        
            Specificity : 0.990260        
         Pos Pred Value : 0.307692        
         Neg Pred Value : 0.943299        
             Prevalence : 0.060020        
         Detection Rate : 0.004069        
   Detection Prevalence : 0.013225        
      Balanced Accuracy : 0.529028        
                                          
       'Positive' Class : insurance       
                                          
> 
> set.seed(1410)
> lrFit <- train(CARAVAN ~ .,
+                data = trainingInd[, noNZVSet],
+                method = "glm",
+                trControl = ctrl,
+                metric = "ROC")
There were 20 warnings (use warnings() to see them)
> lrFit
Generalized Linear Model 

6877 samples
 203 predictors
   2 classes: 'insurance', 'noinsurance' 

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 6190, 6190, 6188, 6189, 6189, 6190, ... 

Resampling results

  ROC    Sens    Spec   Accuracy  Kappa   ROC SD  Sens SD  Spec SD  Accuracy SD
  0.702  0.0121  0.998  0.939     0.0179  0.0488  0.0128   0.0032   0.00323    
  Kappa SD
  0.0249  

 
> 
> evalResults$LogReg <- predict(lrFit, evaluationInd[, noNZVSet], type = "prob")[,1]
Warning messages:
1: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==  :
  prediction from a rank-deficient fit may be misleading
2: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==  :
  prediction from a rank-deficient fit may be misleading
> testResults$LogReg <- predict(lrFit, testingInd[, noNZVSet], type = "prob")[,1]
Warning messages:
1: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==  :
  prediction from a rank-deficient fit may be misleading
2: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==  :
  prediction from a rank-deficient fit may be misleading
> lrROC <- roc(evalResults$CARAVAN, evalResults$LogReg,
+              levels = rev(levels(evalResults$CARAVAN)))
> lrROC

Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$LogReg,     levels = rev(levels(evalResults$CARAVAN)))

Data: evalResults$LogReg in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7267
> 
> lrEvalCM <- confusionMatrix(predict(lrFit, evaluationInd), evalResults$CARAVAN)
Warning message:
In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==  :
  prediction from a rank-deficient fit may be misleading
> lrEvalCM
Confusion Matrix and Statistics

             Reference
Prediction    insurance noinsurance
  insurance           1           2
  noinsurance        58         922
                                          
               Accuracy : 0.939           
                 95% CI : (0.9221, 0.9531)
    No Information Rate : 0.94            
    P-Value [Acc > NIR] : 0.5872          
                                          
                  Kappa : 0.0266          
 Mcnemar's Test P-Value : 1.243e-12       
                                          
            Sensitivity : 0.016949        
            Specificity : 0.997835        
         Pos Pred Value : 0.333333        
         Neg Pred Value : 0.940816        
             Prevalence : 0.060020        
         Detection Rate : 0.001017        
   Detection Prevalence : 0.003052        
      Balanced Accuracy : 0.507392        
                                          
       'Positive' Class : insurance       
                                          
> 
> set.seed(1401)
> fdaFit <- train(CARAVAN ~ ., data = training,
+                 method = "fda",
+                 tuneGrid = data.frame(degree = 1, nprune = 1:25),
+                 metric = "ROC",
+                 trControl = ctrl)
Loading required package: earth
Loading required package: leaps
Loading required package: plotmo
Loading required package: plotrix
Loading required package: mda
> fdaFit
Flexible Discriminant Analysis 

6877 samples
  85 predictors
   2 classes: 'insurance', 'noinsurance' 

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 6189, 6190, 6190, 6189, 6189, 6189, ... 

Resampling results across tuning parameters:

  nprune  ROC    Sens    Spec   Accuracy  Kappa    ROC SD  Sens SD  Spec SD
  1       0.5    0       1      0.94      0        0       0        0      
  2       0.664  0       1      0.94      0        0.0291  0        0      
  3       0.691  0       0.999  0.94      -0.0011  0.0272  0        0.00149
  4       0.705  0.0146  0.997  0.938     0.0201   0.0333  0.0171   0.00231
  5       0.704  0.0146  0.997  0.938     0.0206   0.0303  0.0171   0.00251
  6       0.723  0.0244  0.997  0.938     0.0358   0.0325  0.0304   0.00204
  7       0.724  0.0268  0.995  0.937     0.035    0.0323  0.0372   0.00292
  8       0.724  0.0268  0.995  0.937     0.0347   0.0316  0.0372   0.00311
  9       0.728  0.0293  0.995  0.937     0.0383   0.0315  0.0378   0.0032 
  10      0.727  0.0317  0.994  0.936     0.0393   0.0339  0.0382   0.00482
  11      0.73   0.0366  0.993  0.936     0.0475   0.0351  0.0368   0.00484
  12      0.73   0.0415  0.992  0.936     0.0531   0.0325  0.0364   0.00452
  13      0.734  0.0488  0.993  0.936     0.0651   0.0385  0.0398   0.00411
  14      0.73   0.0488  0.992  0.935     0.0626   0.034   0.0415   0.004  
  15      0.732  0.0463  0.992  0.935     0.0599   0.0327  0.0422   0.00307
  16      0.728  0.0537  0.991  0.935     0.0707   0.0356  0.0427   0.00311
  17      0.732  0.0512  0.991  0.935     0.0647   0.0353  0.0437   0.00409
  18      0.731  0.0512  0.991  0.935     0.0648   0.0362  0.0466   0.00398
  19      0.729  0.0488  0.991  0.934     0.0597   0.0369  0.0488   0.00425
  20      0.727  0.0488  0.991  0.934     0.0599   0.0364  0.0488   0.00399
  21      0.727  0.0488  0.991  0.934     0.0599   0.0364  0.0488   0.00399
  22      0.727  0.0488  0.991  0.934     0.0599   0.0364  0.0488   0.00399
  23      0.727  0.0488  0.991  0.934     0.0599   0.0364  0.0488   0.00399
  24      0.727  0.0488  0.991  0.934     0.0599   0.0364  0.0488   0.00399
  25      0.727  0.0488  0.991  0.934     0.0599   0.0364  0.0488   0.00399
  Accuracy SD  Kappa SD
  0.000452     0       
  0.000452     0       
  0.0014       0.00265 
  0.00222      0.0286  
  0.00209      0.0281  
  0.00268      0.0509  
  0.0023       0.0541  
  0.0026       0.0544  
  0.0026       0.0553  
  0.00315      0.0509  
  0.00346      0.0495  
  0.00292      0.0481  
  0.00295      0.0557  
  0.00267      0.0575  
  0.00267      0.0614  
  0.00337      0.0637  
  0.00339      0.0624  
  0.00327      0.0652  
  0.00331      0.0679  
  0.00324      0.0685  
  0.00324      0.0685  
  0.00324      0.0685  
  0.00324      0.0685  
  0.00324      0.0685  
  0.00324      0.0685  

Tuning parameter 'degree' was held constant at a value of 1
ROC was used to select the optimal model using  the largest value.
The final values used for the model were degree = 1 and nprune = 13. 
> 
> evalResults$FDA <- predict(fdaFit, evaluation[, predictors], type = "prob")[,1]
> testResults$FDA <- predict(fdaFit, testing[, predictors], type = "prob")[,1]
> fdaROC <- roc(evalResults$CARAVAN, evalResults$FDA,
+               levels = rev(levels(evalResults$CARAVAN)))
> fdaROC

Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$FDA,     levels = rev(levels(evalResults$CARAVAN)))

Data: evalResults$FDA in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.754
> 
> fdaEvalCM <- confusionMatrix(predict(fdaFit, evaluation[, predictors]), evalResults$CARAVAN)
> fdaEvalCM
Confusion Matrix and Statistics

             Reference
Prediction    insurance noinsurance
  insurance           1           3
  noinsurance        58         921
                                         
               Accuracy : 0.9379         
                 95% CI : (0.921, 0.9522)
    No Information Rate : 0.94           
    P-Value [Acc > NIR] : 0.638          
                                         
                  Kappa : 0.0243         
 Mcnemar's Test P-Value : 4.712e-12      
                                         
            Sensitivity : 0.016949       
            Specificity : 0.996753       
         Pos Pred Value : 0.250000       
         Neg Pred Value : 0.940756       
             Prevalence : 0.060020       
         Detection Rate : 0.001017       
   Detection Prevalence : 0.004069       
      Balanced Accuracy : 0.506851       
                                         
       'Positive' Class : insurance      
                                         
> 
> 
> labs <- c(RF = "Random Forest", LogReg = "Logistic Regression",
+           FDA = "FDA (MARS)")
> lift1 <- lift(CARAVAN ~ RF + LogReg + FDA, data = evalResults,
+               labels = labs)
> 
> plotTheme <- caretTheme()
> 
> plot(fdaROC, type = "S", col = plotTheme$superpose.line$col[3], legacy.axes = TRUE)

Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$FDA,     levels = rev(levels(evalResults$CARAVAN)))

Data: evalResults$FDA in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.754
> plot(rfROC, type = "S", col = plotTheme$superpose.line$col[1], add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$RF,     levels = rev(levels(evalResults$CARAVAN)))

Data: evalResults$RF in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7596
> plot(lrROC, type = "S", col = plotTheme$superpose.line$col[2], add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$LogReg,     levels = rev(levels(evalResults$CARAVAN)))

Data: evalResults$LogReg in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7267
> legend(.7, .25,
+        c("Random Forest", "Logistic Regression", "FDA (MARS)"),
+        cex = .85,
+        col = plotTheme$superpose.line$col[1:3],
+        lwd = rep(2, 3),
+        lty = rep(1, 3))
> 
> xyplot(lift1,
+        ylab = "%Events Found",
+        xlab =  "%Customers Evaluated",
+        lwd = 2,
+        type = "l")
> 
> 
> ################################################################################
> ### Section 16.4 Alternate Cutoffs
> 
> rfThresh <- coords(rfROC, x = "best", ret="threshold",
+                    best.method="closest.topleft")
> rfThreshY <- coords(rfROC, x = "best", ret="threshold",
+                     best.method="youden")
> 
> cutText <- ifelse(rfThresh == rfThreshY,
+                   "is the same as",
+                   "is similar to")
> 
> evalResults$rfAlt <- factor(ifelse(evalResults$RF > rfThresh,
+                                    "insurance", "noinsurance"),
+                             levels = levels(evalResults$CARAVAN))
> testResults$rfAlt <- factor(ifelse(testResults$RF > rfThresh,
+                                    "insurance", "noinsurance"),
+                             levels = levels(testResults$CARAVAN))
> rfAltEvalCM <- confusionMatrix(evalResults$rfAlt, evalResults$CARAVAN)
> rfAltEvalCM
Confusion Matrix and Statistics

             Reference
Prediction    insurance noinsurance
  insurance          39         257
  noinsurance        20         667
                                         
               Accuracy : 0.7182         
                 95% CI : (0.689, 0.7462)
    No Information Rate : 0.94           
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.1329         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.66102        
            Specificity : 0.72186        
         Pos Pred Value : 0.13176        
         Neg Pred Value : 0.97089        
             Prevalence : 0.06002        
         Detection Rate : 0.03967        
   Detection Prevalence : 0.30112        
      Balanced Accuracy : 0.69144        
                                         
       'Positive' Class : insurance      
                                         
> 
> rfAltTestCM <- confusionMatrix(testResults$rfAlt, testResults$CARAVAN)
> rfAltTestCM
Confusion Matrix and Statistics

             Reference
Prediction    insurance noinsurance
  insurance          71         467
  noinsurance        45        1379
                                         
               Accuracy : 0.739          
                 95% CI : (0.719, 0.7584)
    No Information Rate : 0.9409         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.1328         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.61207        
            Specificity : 0.74702        
         Pos Pred Value : 0.13197        
         Neg Pred Value : 0.96840        
             Prevalence : 0.05912        
         Detection Rate : 0.03619        
   Detection Prevalence : 0.27421        
      Balanced Accuracy : 0.67954        
                                         
       'Positive' Class : insurance      
                                         
> 
> rfTestCM <- confusionMatrix(predict(rfFit, testingInd), testResults$CARAVAN)
> 
> 
> plot(rfROC, print.thres = c(.5, .3, .10, rfThresh), type = "S",
+      print.thres.pattern = "%.3f (Spec = %.2f, Sens = %.2f)",
+      print.thres.cex = .8, legacy.axes = TRUE)

Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$RF,     levels = rev(levels(evalResults$CARAVAN)))

Data: evalResults$RF in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7596
> 
> ################################################################################
> ### Section 16.5 Adjusting Prior Probabilities
> 
> priors <- table(ticdata$CARAVAN)/nrow(ticdata)*100
> fdaPriors <- fdaFit
> fdaPriors$finalModel$prior <- c(insurance = .6, noinsurance =  .4)
> fdaPriorPred <- predict(fdaPriors, evaluation[,predictors])
> evalResults$FDAprior <-  predict(fdaPriors, evaluation[,predictors], type = "prob")[,1]
> testResults$FDAprior <-  predict(fdaPriors, testing[,predictors], type = "prob")[,1]
> fdaPriorCM <- confusionMatrix(fdaPriorPred, evaluation$CARAVAN)
> fdaPriorCM
Confusion Matrix and Statistics

             Reference
Prediction    insurance noinsurance
  insurance          42         306
  noinsurance        17         618
                                          
               Accuracy : 0.6714          
                 95% CI : (0.6411, 0.7007)
    No Information Rate : 0.94            
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1156          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.71186         
            Specificity : 0.66883         
         Pos Pred Value : 0.12069         
         Neg Pred Value : 0.97323         
             Prevalence : 0.06002         
         Detection Rate : 0.04273         
   Detection Prevalence : 0.35402         
      Balanced Accuracy : 0.69035         
                                          
       'Positive' Class : insurance       
                                          
> 
> fdaPriorROC <- roc(testResults$CARAVAN, testResults$FDAprior,
+                    levels = rev(levels(testResults$CARAVAN)))
> fdaPriorROC

Call:
roc.default(response = testResults$CARAVAN, predictor = testResults$FDAprior,     levels = rev(levels(testResults$CARAVAN)))

Data: testResults$FDAprior in 1846 controls (testResults$CARAVAN noinsurance) < 116 cases (testResults$CARAVAN insurance).
Area under the curve: 0.7469
> 
> ################################################################################
> ### Section 16.7 Sampling Methods
> 
> set.seed(1237)
> downSampled <- downSample(trainingInd[, -ncol(trainingInd)], training$CARAVAN)
> 
> set.seed(1237)
> upSampled <- upSample(trainingInd[, -ncol(trainingInd)], training$CARAVAN)
> 
> library(DMwR)
Loading required package: xts
Loading required package: zoo

Attaching package: ‘zoo’

The following object is masked from ‘package:base’:

    as.Date, as.Date.numeric

Loading required package: quantmod
Loading required package: Defaults
Loading required package: TTR
Version 0.4-0 included new data defaults. See ?getSymbols.
Loading required package: ROCR
Loading required package: gplots
Loading required package: gtools

Attaching package: ‘gtools’

The following object is masked from ‘package:e1071’:

    permutations

Loading required package: gdata
gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.

Attaching package: ‘gdata’

The following object is masked from ‘package:randomForest’:

    combine

The following object is masked from ‘package:stats’:

    nobs

The following object is masked from ‘package:utils’:

    object.size

Loading required package: caTools
Loading required package: grid
Loading required package: KernSmooth
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
Loading required package: MASS

Attaching package: ‘gplots’

The following object is masked from ‘package:plotrix’:

    plotCI

The following object is masked from ‘package:stats’:

    lowess

Loading required package: rpart
Loading required package: abind
Loading required package: cluster

Attaching package: ‘DMwR’

The following object is masked from ‘package:plyr’:

    join

Warning message:
'.path.package' is deprecated.
Use 'path.package' instead.
See help("Deprecated") 
> set.seed(1237)
> smoted <- SMOTE(CARAVAN ~ ., data = trainingInd)
> 
> set.seed(1410)
> rfDown <- train(Class ~ ., data = downSampled,
+                 "rf",
+                 trControl = ctrl,
+                 ntree = 1500,
+                 tuneLength = 5,
+                 metric = "ROC")
> rfDown
Random Forest 

822 samples
503 predictors
  2 classes: 'insurance', 'noinsurance' 

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 740, 740, 739, 739, 740, 740, ... 

Resampling results across tuning parameters:

  mtry  ROC    Sens   Spec   Accuracy  Kappa  ROC SD  Sens SD  Spec SD
  2     0.698  0.652  0.648  0.65      0.3    0.0724  0.0921   0.12   
  7     0.682  0.608  0.677  0.642     0.285  0.0712  0.0715   0.1    
  31    0.69   0.623  0.662  0.642     0.285  0.0582  0.0719   0.079  
  126   0.698  0.628  0.657  0.642     0.285  0.056   0.0655   0.0886 
  502   0.683  0.618  0.63   0.624     0.248  0.0575  0.0516   0.0818 
  Accuracy SD  Kappa SD
  0.0764       0.152   
  0.064        0.128   
  0.0513       0.103   
  0.0489       0.0979  
  0.0413       0.0827  

ROC was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 126. 
> 
> evalResults$RFdown <- predict(rfDown, evaluationInd, type = "prob")[,1]
> testResults$RFdown <- predict(rfDown, testingInd, type = "prob")[,1]
> rfDownROC <- roc(evalResults$CARAVAN, evalResults$RFdown,
+                  levels = rev(levels(evalResults$CARAVAN)))
> rfDownROC

Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$RFdown,     levels = rev(levels(evalResults$CARAVAN)))

Data: evalResults$RFdown in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7922
> 
> set.seed(1401)
> rfDownInt <- train(CARAVAN ~ ., data = trainingInd,
+                    "rf",
+                    ntree = 1500,
+                    tuneLength = 5,
+                    strata = training$CARAVAN,
+                    sampsize = rep(sum(training$CARAVAN == "insurance"), 2),
+                    metric = "ROC",
+                    trControl = ctrl)
> rfDownInt
Random Forest 

6877 samples
 503 predictors
   2 classes: 'insurance', 'noinsurance' 

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 6189, 6190, 6190, 6189, 6189, 6189, ... 

Resampling results across tuning parameters:

  mtry  ROC    Sens   Spec   Accuracy  Kappa  ROC SD  Sens SD  Spec SD
  2     0.703  0.144  0.97   0.92      0.138  0.0353  0.0409   0.00587
  7     0.704  0.424  0.835  0.81      0.133  0.0284  0.0737   0.0204 
  31    0.72   0.414  0.857  0.831     0.154  0.0286  0.0601   0.0188 
  126   0.722  0.424  0.841  0.816     0.14   0.0306  0.0667   0.0171 
  502   0.718  0.465  0.824  0.802     0.141  0.0356  0.0692   0.021  
  Accuracy SD  Kappa SD
  0.00682      0.0535  
  0.0176       0.0323  
  0.0183       0.0401  
  0.0167       0.0374  
  0.0201       0.0373  

ROC was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 126. 
> 
> evalResults$RFdownInt <- predict(rfDownInt, evaluationInd, type = "prob")[,1]
> testResults$RFdownInt <- predict(rfDownInt, testingInd, type = "prob")[,1]
> rfDownIntRoc <- roc(evalResults$CARAVAN,
+                     evalResults$RFdownInt,
+                     levels = rev(levels(training$CARAVAN)))
> rfDownIntRoc

Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$RFdownInt,     levels = rev(levels(training$CARAVAN)))

Data: evalResults$RFdownInt in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7962
> 
> set.seed(1410)
> rfUp <- train(Class ~ ., data = upSampled,
+               "rf",
+               trControl = ctrl,
+               ntree = 1500,
+               tuneLength = 5,
+               metric = "ROC")
> rfUp
Random Forest 

12932 samples
  503 predictors
    2 classes: 'insurance', 'noinsurance' 

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 11640, 11638, 11639, 11638, 11638, 11640, ... 

Resampling results across tuning parameters:

  mtry  ROC    Sens   Spec   Accuracy  Kappa  ROC SD   Sens SD  Spec SD
  2     0.865  0.836  0.731  0.783     0.567  0.0115   0.00971  0.0186 
  7     0.987  0.992  0.861  0.927     0.853  0.00354  0.00375  0.0226 
  31    0.993  0.999  0.938  0.968     0.937  0.00309  0.00167  0.0127 
  126   0.992  1      0.95   0.975     0.95   0.00345  0        0.0103 
  502   0.992  1      0.943  0.971     0.943  0.00379  0        0.0136 
  Accuracy SD  Kappa SD
  0.0112       0.0224  
  0.01         0.02    
  0.00668      0.0134  
  0.00515      0.0103  
  0.00681      0.0136  

ROC was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 31. 
> 
> evalResults$RFup <- predict(rfUp, evaluationInd, type = "prob")[,1]
> testResults$RFup <- predict(rfUp, testingInd, type = "prob")[,1]
> rfUpROC <- roc(evalResults$CARAVAN, evalResults$RFup,
+                levels = rev(levels(evalResults$CARAVAN)))
> rfUpROC

Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$RFup,     levels = rev(levels(evalResults$CARAVAN)))

Data: evalResults$RFup in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7336
> 
> set.seed(1410)
> rfSmote <- train(CARAVAN ~ ., data = smoted,
+                  "rf",
+                  trControl = ctrl,
+                  ntree = 1500,
+                  tuneLength = 5,
+                  metric = "ROC")
> rfSmote
Random Forest 

2877 samples
 503 predictors
   2 classes: 'insurance', 'noinsurance' 

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 2590, 2589, 2589, 2590, 2588, 2590, ... 

Resampling results across tuning parameters:

  mtry  ROC    Sens   Spec   Accuracy  Kappa  ROC SD  Sens SD  Spec SD
  2     0.906  0.666  0.998  0.856     0.693  0.0215  0.0322   0.00409
  7     0.908  0.69   0.973  0.852     0.687  0.0177  0.0299   0.0241 
  31    0.914  0.731  0.947  0.854     0.695  0.0168  0.0243   0.0223 
  126   0.918  0.736  0.942  0.853     0.693  0.0146  0.0231   0.0208 
  502   0.912  0.742  0.923  0.845     0.678  0.0151  0.0201   0.0306 
  Accuracy SD  Kappa SD
  0.0142       0.0314  
  0.0215       0.0451  
  0.0183       0.0378  
  0.0154       0.0319  
  0.0211       0.0428  

ROC was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 126. 
> 
> evalResults$RFsmote <- predict(rfSmote, evaluationInd, type = "prob")[,1]
> testResults$RFsmote <- predict(rfSmote, testingInd, type = "prob")[,1]
> rfSmoteROC <- roc(evalResults$CARAVAN, evalResults$RFsmote,
+                   levels = rev(levels(evalResults$CARAVAN)))
> rfSmoteROC

Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$RFsmote,     levels = rev(levels(evalResults$CARAVAN)))

Data: evalResults$RFsmote in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7675
> 
> rfSmoteCM <- confusionMatrix(predict(rfSmote, evaluationInd), evalResults$CARAVAN)
> rfSmoteCM
Confusion Matrix and Statistics

             Reference
Prediction    insurance noinsurance
  insurance          11          50
  noinsurance        48         874
                                          
               Accuracy : 0.9003          
                 95% CI : (0.8799, 0.9183)
    No Information Rate : 0.94            
    P-Value [Acc > NIR] : 1.0000          
                                          
                  Kappa : 0.1303          
 Mcnemar's Test P-Value : 0.9195          
                                          
            Sensitivity : 0.18644         
            Specificity : 0.94589         
         Pos Pred Value : 0.18033         
         Neg Pred Value : 0.94794         
             Prevalence : 0.06002         
         Detection Rate : 0.01119         
   Detection Prevalence : 0.06205         
      Balanced Accuracy : 0.56616         
                                          
       'Positive' Class : insurance       
                                          
> 
> samplingSummary <- function(x, evl, tst)
+   {
+     lvl <- rev(levels(tst$CARAVAN))
+     evlROC <- roc(evl$CARAVAN,
+                   predict(x, evl, type = "prob")[,1],
+                   levels = lvl)
+     rocs <- c(auc(evlROC),
+               auc(roc(tst$CARAVAN,
+                       predict(x, tst, type = "prob")[,1],
+                       levels = lvl)))
+     cut <- coords(evlROC, x = "best", ret="threshold",
+                   best.method="closest.topleft")
+     bestVals <- coords(evlROC, cut, ret=c("sensitivity", "specificity"))
+     out <- c(rocs, bestVals*100)
+     names(out) <- c("evROC", "tsROC", "tsSens", "tsSpec")
+     out
+ 
+   }
> 
> rfResults <- rbind(samplingSummary(rfFit, evaluationInd, testingInd),
+                    samplingSummary(rfDown, evaluationInd, testingInd),
+                    samplingSummary(rfDownInt, evaluationInd, testingInd),
+                    samplingSummary(rfUp, evaluationInd, testingInd),
+                    samplingSummary(rfSmote, evaluationInd, testingInd))
> rownames(rfResults) <- c("Original", "Down--Sampling",  "Down--Sampling (Internal)",
+                          "Up--Sampling", "SMOTE")
> 
> rfResults
                              evROC     tsROC   tsSens   tsSpec
Original                  0.7596119 0.7360673 66.10169 72.18615
Down--Sampling            0.7921894 0.7291301 86.44068 67.74892
Down--Sampling (Internal) 0.7961516 0.7649158 66.10169 80.30303
Up--Sampling              0.7336195 0.7408283 72.88136 63.96104
SMOTE                     0.7675178 0.7318643 81.35593 65.36797
> 
> rocCols <- c("black", rgb(1, 0, 0, .5), rgb(0, 0, 1, .5))
> 
> plot(roc(testResults$CARAVAN, testResults$RF, levels = rev(levels(testResults$CARAVAN))),
+      type = "S", col = rocCols[1], legacy.axes = TRUE)

Call:
roc.default(response = testResults$CARAVAN, predictor = testResults$RF,     levels = rev(levels(testResults$CARAVAN)))

Data: testResults$RF in 1846 controls (testResults$CARAVAN noinsurance) < 116 cases (testResults$CARAVAN insurance).
Area under the curve: 0.7361
> plot(roc(testResults$CARAVAN, testResults$RFdownInt, levels = rev(levels(testResults$CARAVAN))),
+      type = "S", col = rocCols[2],add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = testResults$CARAVAN, predictor = testResults$RFdownInt,     levels = rev(levels(testResults$CARAVAN)))

Data: testResults$RFdownInt in 1846 controls (testResults$CARAVAN noinsurance) < 116 cases (testResults$CARAVAN insurance).
Area under the curve: 0.7649
> plot(roc(testResults$CARAVAN, testResults$RFsmote, levels = rev(levels(testResults$CARAVAN))),
+      type = "S", col = rocCols[3], add = TRUE, legacy.axes = TRUE)

Call:
roc.default(response = testResults$CARAVAN, predictor = testResults$RFsmote,     levels = rev(levels(testResults$CARAVAN)))

Data: testResults$RFsmote in 1846 controls (testResults$CARAVAN noinsurance) < 116 cases (testResults$CARAVAN insurance).
Area under the curve: 0.7319
> legend(.6, .4,
+        c("Normal", "Down-Sampling (Internal)", "SMOTE"),
+        lty = rep(1, 3),
+        lwd = rep(2, 3),
+        cex = .8,
+        col = rocCols)
> 
> xyplot(lift(CARAVAN ~ RF + RFdownInt + RFsmote,
+             data = testResults),
+        type = "l",
+        ylab = "%Events Found",
+        xlab =  "%Customers Evaluated")
> 
> 
> ################################################################################
> ### Section 16.8 Cost–Sensitive Training
> 
> library(kernlab)
> 
> set.seed(1157)
> sigma <- sigest(CARAVAN ~ ., data = trainingInd[, noNZVSet], frac = .75)
> names(sigma) <- NULL
> 
> svmGrid1 <- data.frame(sigma = sigma[2],
+                        C = 2^c(2:10))
> 
> set.seed(1401)
> svmFit <- train(CARAVAN ~ .,
+                 data = trainingInd[, noNZVSet],
+                 method = "svmRadial",
+                 tuneGrid = svmGrid1,
+                 preProc = c("center", "scale"),
+                 metric = "Kappa",
+                 trControl = ctrl)
> svmFit
Support Vector Machines with Radial Basis Function Kernel 

6877 samples
 203 predictors
   2 classes: 'insurance', 'noinsurance' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 6189, 6190, 6190, 6189, 6189, 6189, ... 

Resampling results across tuning parameters:

  C     ROC    Sens  Spec  Accuracy  Kappa      ROC SD  Sens SD  Spec SD 
  4     0.665  0     1     0.94      -0.000285  0.0465  0        0.000489
  8     0.671  0     1     0.94      0          0.0476  0        0       
  16    0.678  0     1     0.94      0          0.041   0        0       
  32    0.678  0     1     0.94      0          0.0368  0        0       
  64    0.668  0     1     0.94      0          0.0399  0        0       
  128   0.655  0     1     0.94      0          0.039   0        0       
  256   0.648  0     1     0.94      0          0.0395  0        0       
  512   0.644  0     1     0.94      0          0.0401  0        0       
  1020  0.643  0     1     0.94      0          0.037   0        0       
  Accuracy SD  Kappa SD
  6e-04        9e-04   
  0.000452     0       
  0.000452     0       
  0.000452     0       
  0.000452     0       
  0.000452     0       
  0.000452     0       
  0.000452     0       
  0.000452     0       

Tuning parameter 'sigma' was held constant at a value of 0.002454182
Kappa was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.00245 and C = 8. 
> 
> evalResults$SVM <- predict(svmFit, evaluationInd[, noNZVSet], type = "prob")[,1]
> testResults$SVM <- predict(svmFit, testingInd[, noNZVSet], type = "prob")[,1]
> svmROC <- roc(evalResults$CARAVAN, evalResults$SVM,
+               levels = rev(levels(evalResults$CARAVAN)))
> svmROC

Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$SVM,     levels = rev(levels(evalResults$CARAVAN)))

Data: evalResults$SVM in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.6952
> 
> svmTestROC <- roc(testResults$CARAVAN, testResults$SVM,
+                   levels = rev(levels(testResults$CARAVAN)))
> svmTestROC

Call:
roc.default(response = testResults$CARAVAN, predictor = testResults$SVM,     levels = rev(levels(testResults$CARAVAN)))

Data: testResults$SVM in 1846 controls (testResults$CARAVAN noinsurance) < 116 cases (testResults$CARAVAN insurance).
Area under the curve: 0.6974
> 
> confusionMatrix(predict(svmFit, evaluationInd[, noNZVSet]), evalResults$CARAVAN)
Confusion Matrix and Statistics

             Reference
Prediction    insurance noinsurance
  insurance           0           0
  noinsurance        59         924
                                         
               Accuracy : 0.94           
                 95% CI : (0.9233, 0.954)
    No Information Rate : 0.94           
    P-Value [Acc > NIR] : 0.5346         
                                         
                  Kappa : 0              
 Mcnemar's Test P-Value : 4.321e-14      
                                         
            Sensitivity : 0.00000        
            Specificity : 1.00000        
         Pos Pred Value :     NaN        
         Neg Pred Value : 0.93998        
             Prevalence : 0.06002        
         Detection Rate : 0.00000        
   Detection Prevalence : 0.00000        
      Balanced Accuracy : 0.50000        
                                         
       'Positive' Class : insurance      
                                         
> 
> confusionMatrix(predict(svmFit, testingInd[, noNZVSet]), testingInd$CARAVAN)
Confusion Matrix and Statistics

             Reference
Prediction    insurance noinsurance
  insurance           0           0
  noinsurance       116        1846
                                          
               Accuracy : 0.9409          
                 95% CI : (0.9295, 0.9509)
    No Information Rate : 0.9409          
    P-Value [Acc > NIR] : 0.5247          
                                          
                  Kappa : 0               
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.00000         
            Specificity : 1.00000         
         Pos Pred Value :     NaN         
         Neg Pred Value : 0.94088         
             Prevalence : 0.05912         
         Detection Rate : 0.00000         
   Detection Prevalence : 0.00000         
      Balanced Accuracy : 0.50000         
                                          
       'Positive' Class : insurance       
                                          
> 
> 
> set.seed(1401)
> svmWtFit <- train(CARAVAN ~ .,
+                   data = trainingInd[, noNZVSet],
+                   method = "svmRadial",
+                   tuneGrid = svmGrid1,
+                   preProc = c("center", "scale"),
+                   metric = "Kappa",
+                   class.weights = c(insurance = 18, noinsurance = 1),
+                   trControl = ctrlNoProb)
> svmWtFit
Support Vector Machines with Radial Basis Function Kernel 

6877 samples
 203 predictors
   2 classes: 'insurance', 'noinsurance' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 6189, 6190, 6190, 6189, 6189, 6189, ... 

Resampling results across tuning parameters:

  C     Accuracy  Kappa   Sens   Spec   Accuracy SD  Kappa SD  Sens SD  Spec SD
  4     0.818     0.105   0.343  0.849  0.016        0.0399    0.0853   0.0186 
  8     0.842     0.116   0.309  0.876  0.0142       0.0339    0.0605   0.0159 
  16    0.855     0.105   0.256  0.893  0.0192       0.0442    0.0602   0.0207 
  32    0.869     0.11    0.234  0.909  0.0159       0.0507    0.0633   0.0167 
  64    0.876     0.0948  0.195  0.919  0.0173       0.0426    0.0435   0.0179 
  128   0.879     0.0865  0.175  0.924  0.0155       0.049     0.0503   0.0155 
  256   0.88      0.0843  0.17   0.925  0.0154       0.0419    0.0386   0.0154 
  512   0.879     0.0739  0.161  0.925  0.015        0.0501    0.0557   0.0157 
  1020  0.88      0.073   0.158  0.925  0.0148       0.0511    0.0569   0.0154 

Tuning parameter 'sigma' was held constant at a value of 0.002454182
Kappa was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.00245 and C = 8. 
> 
> svmWtEvalCM <- confusionMatrix(predict(svmWtFit, evaluationInd[, noNZVSet]), evalResults$CARAVAN)
> svmWtEvalCM
Confusion Matrix and Statistics

             Reference
Prediction    insurance noinsurance
  insurance          17         123
  noinsurance        42         801
                                         
               Accuracy : 0.8321         
                 95% CI : (0.8073, 0.855)
    No Information Rate : 0.94           
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.0944         
 Mcnemar's Test P-Value : 4.725e-10      
                                         
            Sensitivity : 0.28814        
            Specificity : 0.86688        
         Pos Pred Value : 0.12143        
         Neg Pred Value : 0.95018        
             Prevalence : 0.06002        
         Detection Rate : 0.01729        
   Detection Prevalence : 0.14242        
      Balanced Accuracy : 0.57751        
                                         
       'Positive' Class : insurance      
                                         
> 
> svmWtTestCM <- confusionMatrix(predict(svmWtFit, testingInd[, noNZVSet]), testingInd$CARAVAN)
> svmWtTestCM
Confusion Matrix and Statistics

             Reference
Prediction    insurance noinsurance
  insurance          40         223
  noinsurance        76        1623
                                          
               Accuracy : 0.8476          
                 95% CI : (0.8309, 0.8632)
    No Information Rate : 0.9409          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1406          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.34483         
            Specificity : 0.87920         
         Pos Pred Value : 0.15209         
         Neg Pred Value : 0.95527         
             Prevalence : 0.05912         
         Detection Rate : 0.02039         
   Detection Prevalence : 0.13405         
      Balanced Accuracy : 0.61201         
                                          
       'Positive' Class : insurance       
                                          
> 
> 
> initialRpart <- rpart(CARAVAN ~ ., data = training,
+                       control = rpart.control(cp = 0.0001))
> rpartGrid <- data.frame(cp = initialRpart$cptable[, "CP"])
> 
> cmat <- list(loss = matrix(c(0, 1, 20, 0), ncol = 2))
> set.seed(1401)
> cartWMod <- train(x = training[,predictors],
+                   y = training$CARAVAN,
+                   method = "rpart",
+                   trControl = ctrlNoProb,
+                   tuneGrid = rpartGrid,
+                   metric = "Kappa",
+                   parms = cmat)
> cartWMod
CART 

6877 samples
  85 predictors
   2 classes: 'insurance', 'noinsurance' 

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 6189, 6190, 6190, 6189, 6189, 6189, ... 

Resampling results across tuning parameters:

  cp        Accuracy  Kappa   Sens   Spec   Accuracy SD  Kappa SD  Sens SD
  1e-04     0.797     0.0734  0.316  0.828  0.018        0.0435    0.0918 
  0.000487  0.797     0.0744  0.319  0.827  0.0189       0.0423    0.0892 
  0.00122   0.778     0.0768  0.36   0.805  0.02         0.037     0.0883 
  0.00162   0.762     0.0844  0.411  0.785  0.0181       0.0298    0.0794 
  0.00243   0.722     0.0805  0.48   0.737  0.024        0.0253    0.0786 
  0.00278   0.707     0.0773  0.499  0.72   0.0229       0.0299    0.0916 
  Spec SD
  0.0203 
  0.0212 
  0.0229 
  0.0208 
  0.0274 
  0.0256 

Kappa was used to select the optimal model using  the largest value.
The final value used for the model was cp = 0.00162. 
> 
> 
> library(C50)
> c5Grid <- expand.grid(model = c("tree", "rules"),
+                       trials = c(1, (1:10)*10),
+                       winnow = FALSE)
> 
> finalCost <- matrix(c(0, 20, 1, 0), ncol = 2)
> rownames(finalCost) <- colnames(finalCost) <- levels(training$CARAVAN)
> set.seed(1401)
> C5CostFit <- train(training[, predictors],
+                    training$CARAVAN,
+                    method = "C5.0",
+                    metric = "Kappa",
+                    tuneGrid = c5Grid,
+                    cost = finalCost,
+                    control = C5.0Control(earlyStopping = FALSE),
+                    trControl = ctrlNoProb)
> 
> C5CostCM <- confusionMatrix(predict(C5CostFit, testing), testing$CARAVAN)
> C5CostCM
Confusion Matrix and Statistics

             Reference
Prediction    insurance noinsurance
  insurance          64         623
  noinsurance        52        1223
                                         
               Accuracy : 0.656          
                 95% CI : (0.6345, 0.677)
    No Information Rate : 0.9409         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.0648         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.55172        
            Specificity : 0.66251        
         Pos Pred Value : 0.09316        
         Neg Pred Value : 0.95922        
             Prevalence : 0.05912        
         Detection Rate : 0.03262        
   Detection Prevalence : 0.35015        
      Balanced Accuracy : 0.60712        
                                         
       'Positive' Class : insurance      
                                         
> 
> 
> ################################################################################
> ### Session Information
> 
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] C50_0.1.0-14       kernlab_0.9-16     DMwR_0.3.0         cluster_1.14.4    
 [5] abind_1.4-0        rpart_4.1-1        ROCR_1.0-4         gplots_2.11.0     
 [9] MASS_7.3-26        KernSmooth_2.23-10 caTools_1.14       gdata_2.12.0      
[13] gtools_2.7.0       quantmod_0.4-0     TTR_0.21-1         Defaults_1.1-1    
[17] xts_0.9-3          zoo_1.7-9          mda_0.4-2          earth_3.2-3       
[21] plotrix_3.4-6      plotmo_1.3-2       leaps_2.9          e1071_1.6-1       
[25] class_7.3-7        pROC_1.5.4         plyr_1.8           randomForest_4.6-7
[29] caret_6.0-22       ggplot2_0.9.3.1    DWD_0.10           Matrix_1.0-12     
[33] lattice_0.20-15   

loaded via a namespace (and not attached):
 [1] bitops_1.0-5       car_2.0-16         codetools_0.2-8    colorspace_1.2-1  
 [5] compiler_3.0.1     dichromat_2.0-0    digest_0.6.3       foreach_1.4.0     
 [9] gtable_0.1.2       iterators_1.0.6    labeling_0.1       munsell_0.4       
[13] proto_0.3-10       RColorBrewer_1.0-5 reshape2_1.2.2     scales_0.2.3      
[17] stringr_0.6.2      tools_3.0.1       
> 
> q("no")
> proc.time()
      user     system    elapsed 
243437.520    682.066 244138.032 
In [80]:
%%R -w 600 -h 600

## runChapterScript(16)

##        user     system    elapsed 
##  243437.520    682.066 244138.032
NULL
In [89]:
%%R

showChapterScript(17)
NULL
In [82]:
%%R

showChapterOutput(17)
R Information
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com) 
> ###
> ### Chapter 17: Case Study: Job Scheduling
> ###
> ### Required packages: AppliedPredictiveModeling, C50, caret, doMC (optional),
> ###                    earth, Hmisc, ipred, tabplot, kernlab, lattice, MASS,
> ###                    mda, nnet, pls, randomForest, rpart, sparseLDA, 
> ###
> ### Data used: The HPC job scheduling data in the AppliedPredictiveModeling
> ###            package.
> ###
> ### Notes: 
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be 
> ### syntax differences that occur over time as packages evolve. These files 
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in 
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
> 
> library(AppliedPredictiveModeling)
> data(schedulingData)
> 
> ### Make a vector of predictor names
> predictors <- names(schedulingData)[!(names(schedulingData) %in% c("Class"))]
> 
> ### A few summaries and plots of the data
> library(Hmisc)
Loading required package: survival
Loading required package: splines
Hmisc library by Frank E Harrell Jr

Type library(help='Hmisc'), ?Overview, or ?Hmisc.Overview')
to see overall documentation.

NOTE:Hmisc no longer redefines [.factor to drop unused levels when
subsetting.  To get the old behavior of Hmisc type dropUnusedLevels().


Attaching package: ‘Hmisc’

The following object is masked from ‘package:survival’:

    untangle.specials

The following object is masked from ‘package:base’:

    format.pval, round.POSIXt, trunc.POSIXt, units

> describe(schedulingData)
schedulingData 

 8  Variables      4331  Observations
--------------------------------------------------------------------------------
Protocol 
      n missing  unique 
   4331       0      14 

           A   C   D  E   F   G   H   I   J K   L   M   N   O
Frequency 94 160 149 96 170 155 321 381 989 6 242 451 536 581
%          2   4   3  2   4   4   7   9  23 0   6  10  12  13
--------------------------------------------------------------------------------
Compounds 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90 
   4331       0     858   497.7      27      37      98     226     448     967 
    .95 
   2512 

lowest :    20    21    22    23    24, highest: 14087 14090 14091 14097 14103 
--------------------------------------------------------------------------------
InputFields 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90 
   4331       0    1730    1537      26      48     134     426     991    4165 
    .95 
   7594 

lowest :    10    11    12    13    14, highest: 36021 45420 45628 55920 56671 
--------------------------------------------------------------------------------
Iterations 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90 
   4331       0      11   29.24      10      20      20      20      20      50 
    .95 
    100 

           10 11 15   20 30 40  50 100 125 150 200
Frequency 272  9  2 3568  3  7 153 188   1   2 126
%           6  0  0   82  0  0   4   4   0   0   3
--------------------------------------------------------------------------------
NumPending 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90 
   4331       0     303   53.39     0.0     0.0     0.0     0.0     0.0    33.0 
    .95 
  145.5 

lowest :    0    1    2    3    4, highest: 3822 3870 3878 5547 5605 
--------------------------------------------------------------------------------
Hour 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90 
   4331       0     924   13.73   7.025   9.333  10.900  14.017  16.600  18.250 
    .95 
 19.658 

lowest :  0.01667  0.03333  0.08333  0.10000  0.11667
highest: 23.20000 23.21667 23.35000 23.80000 23.98333 
--------------------------------------------------------------------------------
Day 
      n missing  unique 
   4331       0       7 

          Mon Tue Wed Thu Fri Sat Sun
Frequency 692 900 903 720 923  32 161
%          16  21  21  17  21   1   4
--------------------------------------------------------------------------------
Class 
      n missing  unique 
   4331       0       4 

VF (2211, 51%), F (1347, 31%), M (514, 12%), L (259, 6%) 
--------------------------------------------------------------------------------
> 
> library(tabplot)
Loading required package: ffbase
Loading required package: ff
Loading required package: tools
Loading required package: bit
Attaching package bit
package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
creators: bit bitwhich
coercion: as.logical as.integer as.bit as.bitwhich which
operator: ! & | xor != ==
querying: print length any all min max range sum summary
bit access: length<- [ [<- [[ [[<-
for more help type ?bit

Attaching package: ‘bit’

The following object is masked from ‘package:base’:

    xor

Attaching package ff
- getOption("fftempdir")=="/var/folders/Zf/ZfjbGEqKH2GPlbqofbYnBU+++TI/-Tmp-//RtmpZwCCTR"

- getOption("ffextension")=="ff"

- getOption("ffdrop")==TRUE

- getOption("fffinonexit")==TRUE

- getOption("ffpagesize")==65536

- getOption("ffcaching")=="mmnoflush"  -- consider "ffeachflush" if your system stalls on large writes

- getOption("ffbatchbytes")==16777216 -- consider a different value for tuning your system

- getOption("ffmaxbytes")==536870912 -- consider a different value for tuning your system


Attaching package: ‘ff’

The following object is masked from ‘package:utils’:

    write.csv, write.csv2

The following object is masked from ‘package:base’:

    is.factor, is.ordered


Attaching package: ‘ffbase’

The following object is masked from ‘package:base’:

    %in%

Loading required package: grid
> tableplot(schedulingData[, c( "Class", predictors)])
> 
> mosaicplot(table(schedulingData$Protocol, 
+                  schedulingData$Class), 
+            main = "")
> 
> library(lattice)
> xyplot(Compounds ~ InputFields|Protocol,
+        data = schedulingData,
+        scales = list(x = list(log = 10), y = list(log = 10)),
+        groups = Class,
+        xlab = "Input Fields",
+        auto.key = list(columns = 4),
+        aspect = 1,
+        as.table = TRUE)
> 
> 
> ################################################################################
> ### Section 17.1 Data Splitting and Model Strategy
> 
> ## Split the data
> 
> library(caret)
Loading required package: ggplot2

Attaching package: ‘caret’

The following object is masked from ‘package:survival’:

    cluster

> set.seed(1104)
> inTrain <- createDataPartition(schedulingData$Class, p = .8, list = FALSE)
> 
> ### There are a lot of zeros and the distribution is skewed. We add
> ### one so that we can log transform the data
> schedulingData$NumPending <- schedulingData$NumPending + 1
> 
> trainData <- schedulingData[ inTrain,]
> testData  <- schedulingData[-inTrain,]
> 
> ### Create a main effects only model formula to use
> ### repeatedly. Another formula with nonlinear effects is created
> ### below.
> modForm <- as.formula(Class ~ Protocol + log10(Compounds) +
+   log10(InputFields)+ log10(Iterations) +
+   log10(NumPending) + Hour + Day)
> 
> ### Create an expanded set of predictors with interactions. 
> 
> modForm2 <- as.formula(Class ~ (Protocol + log10(Compounds) +
+   log10(InputFields)+ log10(Iterations) +
+   log10(NumPending) + Hour + Day)^2)
> 
> 
> ### Some of these terms will not be estimable. For example, if there
> ### are no data points were a particular protocol was run on a
> ### particular day, the full interaction cannot be computed. We use
> ### model.matrix() to create the whole set of predictor columns, then
> ### remove those that are zero variance
> 
> expandedTrain <- model.matrix(modForm2, data = trainData)
> expandedTest  <- model.matrix(modForm2, data = testData)
> expandedTrain <- as.data.frame(expandedTrain)
> expandedTest  <-  as.data.frame(expandedTest)
> 
> ### Some models have issues when there is a zero variance predictor
> ### within the data of a particular class, so we used caret's
> ### checkConditionalX() function to find the offending columns and
> ### remove them
> 
> zv <- checkConditionalX(expandedTrain, trainData$Class)
> 
> ### Keep the expanded set to use for models where we must manually add
> ### more complex terms (such as logistic regression)
> 
> expandedTrain <-  expandedTrain[,-zv]
> expandedTest  <-  expandedTest[, -zv]
> 
> ### Create the cost matrix
> costMatrix <- ifelse(diag(4) == 1, 0, 1)
> costMatrix[4, 1] <- 10
> costMatrix[3, 1] <- 5
> costMatrix[4, 2] <- 5
> costMatrix[3, 2] <- 5
> rownames(costMatrix) <- colnames(costMatrix) <- levels(trainData$Class)
> 
> ### Create a cost function
> cost <- function(pred, obs)
+ {
+   isNA <- is.na(pred)
+   if(!all(isNA))
+   {
+     pred <- pred[!isNA]
+     obs <- obs[!isNA]
+     
+     cost <- ifelse(pred == obs, 0, 1)
+     if(any(pred == "VF" & obs == "L")) cost[pred == "L" & obs == "VF"] <- 10
+     if(any(pred == "F" & obs == "L")) cost[pred == "F" & obs == "L"] <- 5
+     if(any(pred == "F" & obs == "M")) cost[pred == "F" & obs == "M"] <- 5
+     if(any(pred == "VF" & obs == "M")) cost[pred == "VF" & obs == "M"] <- 5
+     out <- mean(cost)
+   } else out <- NA
+   out
+ }
> 
> ### Make a summary function that can be used with caret's train() function
> costSummary <- function (data, lev = NULL, model = NULL)
+ {
+   if (is.character(data$obs))  data$obs <- factor(data$obs, levels = lev)
+   c(postResample(data[, "pred"], data[, "obs"]),
+     Cost = cost(data[, "pred"], data[, "obs"]))
+ }
> 
> ### Create a control object for the models
> ctrl <- trainControl(method = "repeatedcv", 
+                      repeats = 5,
+                      summaryFunction = costSummary)
> 
> ### Optional: parallel processing can be used via the 'do' packages,
> ### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
> ### up the computations.
> 
> ### WARNING: Be aware of how much memory is needed to parallel
> ### process. It can very quickly overwhelm the available hardware. The
> ### estimate of the median memory usage (VSIZE = total memory size) 
> ### was 3300-4100M per core although the some calculations require as  
> ### much as 3400M without parallel processing. 
> 
> library(doMC)
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
> registerDoMC(14)
> 
> ### Fit the CART model with and without costs
> 
> set.seed(857)
> rpFit <- train(x = trainData[, predictors],
+                y = trainData$Class,
+                method = "rpart",
+                metric = "Cost",
+                maximize = FALSE,
+                tuneLength = 20,
+                trControl = ctrl)
Loading required package: rpart
Loading required package: class

Attaching package: ‘e1071’

The following object is masked from ‘package:Hmisc’:

    impute

> rpFit
CART 

3467 samples
   7 predictors
   4 classes: 'VF', 'F', 'M', 'L' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ... 

Resampling results across tuning parameters:

  cp       Accuracy  Kappa  Cost   Accuracy SD  Kappa SD  Cost SD
  0.00236  0.774     0.631  0.51   0.0193       0.0323    0.0617 
  0.00249  0.773     0.63   0.514  0.0193       0.0319    0.0591 
  0.00294  0.768     0.621  0.537  0.0176       0.0305    0.0514 
  0.00324  0.766     0.617  0.542  0.0169       0.0298    0.0521 
  0.00353  0.764     0.611  0.55   0.017        0.03      0.0491 
  0.00383  0.762     0.607  0.56   0.0182       0.0321    0.0538 
  0.00471  0.76      0.603  0.569  0.0193       0.0345    0.0607 
  0.0053   0.758     0.597  0.58   0.0183       0.0326    0.0567 
  0.00589  0.756     0.594  0.585  0.0201       0.0355    0.0591 
  0.00648  0.751     0.586  0.604  0.0205       0.036     0.059  
  0.00824  0.735     0.558  0.647  0.0184       0.0327    0.0491 
  0.00942  0.727     0.544  0.663  0.0184       0.0328    0.0476 
  0.00982  0.723     0.539  0.667  0.0181       0.0325    0.047  
  0.01     0.719     0.532  0.67   0.0175       0.0317    0.0454 
  0.0159   0.703     0.505  0.697  0.0192       0.0327    0.0518 
  0.0171   0.698     0.495  0.717  0.0179       0.032     0.0586 
  0.0183   0.693     0.482  0.755  0.0208       0.0409    0.0797 
  0.0205   0.67      0.42   0.871  0.0227       0.0493    0.0626 
  0.0383   0.652     0.376  0.969  0.0177       0.0346    0.0517 
  0.274    0.568     0.159  0.992  0.0609       0.168     0.0323 

Cost was used to select the optimal model using  the smallest value.
The final value used for the model was cp = 0.00236. 
> 
> set.seed(857)
> rpFitCost <- train(x = trainData[, predictors],
+                    y = trainData$Class,
+                    method = "rpart",
+                    metric = "Cost",
+                    maximize = FALSE,
+                    tuneLength = 20,
+                    parms =list(loss = costMatrix),
+                    trControl = ctrl)
> rpFitCost
CART 

3467 samples
   7 predictors
   4 classes: 'VF', 'F', 'M', 'L' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ... 

Resampling results across tuning parameters:

  cp       Accuracy  Kappa  Cost   Accuracy SD  Kappa SD  Cost SD
  0.00236  0.72      0.565  0.343  0.0161       0.0248    0.0325 
  0.00249  0.718     0.562  0.344  0.0162       0.0248    0.0336 
  0.00294  0.717     0.56   0.344  0.0186       0.0277    0.0349 
  0.00324  0.717     0.56   0.345  0.0182       0.0272    0.0344 
  0.00353  0.713     0.555  0.35   0.0197       0.0293    0.0362 
  0.00383  0.707     0.545  0.358  0.0201       0.0297    0.038  
  0.00471  0.699     0.533  0.366  0.0205       0.0297    0.0386 
  0.0053   0.685     0.513  0.381  0.0196       0.0281    0.0376 
  0.00589  0.675     0.501  0.392  0.0207       0.0288    0.0378 
  0.00648  0.656     0.479  0.403  0.0372       0.0482    0.0461 
  0.00824  0.63      0.449  0.428  0.0451       0.0555    0.0476 
  0.00942  0.623     0.44   0.436  0.0574       0.0687    0.0478 
  0.00982  0.62      0.436  0.443  0.0581       0.0697    0.0457 
  0.01     0.617     0.433  0.445  0.0583       0.0699    0.0436 
  0.0159   0.53      0.324  0.507  0.0257       0.0303    0.0312 
  0.0171   0.52      0.306  0.526  0.0201       0.0223    0.0276 
  0.0183   0.521     0.305  0.527  0.0194       0.0219    0.0277 
  0.0205   0.515     0.295  0.532  0.0187       0.0231    0.0299 
  0.0383   0.503     0.275  0.546  0.0161       0.0179    0.0269 
  0.274    0.119     0      0.881  0.00104      0         0.00104

Cost was used to select the optimal model using  the smallest value.
The final value used for the model was cp = 0.00236. 
> 
> set.seed(857)
> ldaFit <- train(x = expandedTrain,
+                 y = trainData$Class,
+                 method = "lda",
+                 metric = "Cost",
+                 maximize = FALSE,
+                 trControl = ctrl)
Loading required package: MASS
> ldaFit
Linear Discriminant Analysis 

3467 samples
 112 predictors
   4 classes: 'VF', 'F', 'M', 'L' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ... 

Resampling results

  Accuracy  Kappa  Cost   Accuracy SD  Kappa SD  Cost SD
  0.756     0.602  0.523  0.0232       0.0389    0.0495 

 
> 
> sldaGrid <- expand.grid(NumVars = seq(2, 112, by = 5),
+                         lambda = c(0, 0.01, .1, 1, 10))
> set.seed(857)
> sldaFit <- train(x = expandedTrain,
+                  y = trainData$Class,
+                  method = "sparseLDA",
+                  tuneGrid = sldaGrid,
+                  preProc = c("center", "scale"),
+                  metric = "Cost",
+                  maximize = FALSE,
+                  trControl = ctrl)
Loading required package: sparseLDA
Loading required package: lars
Loaded lars 1.2

Loading required package: elasticnet
Loading required package: mda
> sldaFit
Sparse Linear Discriminant Analysis 

3467 samples
 112 predictors
   4 classes: 'VF', 'F', 'M', 'L' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ... 

Resampling results across tuning parameters:

  NumVars  lambda  Accuracy  Kappa  Cost   Accuracy SD  Kappa SD  Cost SD
  2        0       0.662     0.416  0.692  0.018        0.0326    0.0585 
  2        0.01    0.663     0.416  0.692  0.0179       0.0331    0.058  
  2        0.1     0.663     0.417  0.691  0.0169       0.0311    0.0573 
  2        1       0.662     0.416  0.693  0.0181       0.0327    0.0602 
  2        10      0.664     0.417  0.691  0.0164       0.0306    0.0547 
  7        0       0.681     0.457  0.707  0.0187       0.0333    0.0512 
  7        0.01    0.681     0.457  0.707  0.0187       0.0333    0.0512 
  7        0.1     0.681     0.457  0.707  0.0187       0.0333    0.0512 
  7        1       0.681     0.457  0.707  0.0188       0.0334    0.0512 
  7        10      0.681     0.457  0.707  0.0193       0.0341    0.0503 
  12       0       0.688     0.47   0.687  0.0181       0.0324    0.0526 
  12       0.01    0.688     0.47   0.687  0.0181       0.0324    0.0526 
  12       0.1     0.688     0.471  0.686  0.0182       0.0325    0.0524 
  12       1       0.688     0.47   0.687  0.018        0.0321    0.0522 
  12       10      0.687     0.469  0.689  0.0183       0.0326    0.0516 
  17       0       0.694     0.482  0.661  0.0178       0.0316    0.0516 
  17       0.01    0.694     0.483  0.661  0.0178       0.0317    0.0517 
  17       0.1     0.694     0.483  0.661  0.0181       0.032     0.0519 
  17       1       0.694     0.483  0.66   0.0176       0.0313    0.0512 
  17       10      0.693     0.482  0.662  0.0175       0.0312    0.0491 
  22       0       0.699     0.493  0.651  0.0187       0.0323    0.0487 
  22       0.01    0.699     0.493  0.651  0.0187       0.0323    0.0488 
  22       0.1     0.699     0.493  0.651  0.0187       0.0323    0.0487 
  22       1       0.699     0.493  0.651  0.0187       0.0323    0.0491 
  22       10      0.698     0.491  0.652  0.0185       0.032     0.0501 
  27       0       0.704     0.502  0.638  0.0195       0.0342    0.0578 
  27       0.01    0.704     0.503  0.637  0.0194       0.034     0.0574 
  27       0.1     0.704     0.503  0.638  0.0194       0.034     0.0578 
  27       1       0.704     0.503  0.638  0.0197       0.0345    0.0584 
  27       10      0.703     0.501  0.636  0.0199       0.0347    0.0592 
  32       0       0.712     0.518  0.626  0.0191       0.0336    0.0572 
  32       0.01    0.712     0.518  0.625  0.0191       0.0336    0.0572 
  32       0.1     0.712     0.518  0.625  0.0191       0.0336    0.0571 
  32       1       0.712     0.518  0.626  0.0191       0.0335    0.057  
  32       10      0.71      0.515  0.627  0.0193       0.0337    0.0566 
  37       0       0.721     0.536  0.611  0.0187       0.0322    0.0538 
  37       0.01    0.721     0.536  0.611  0.0187       0.0322    0.0538 
  37       0.1     0.721     0.536  0.611  0.0189       0.0324    0.0541 
  37       1       0.721     0.536  0.611  0.0187       0.0321    0.0532 
  37       10      0.717     0.529  0.615  0.0197       0.0339    0.0574 
  42       0       0.725     0.544  0.596  0.0186       0.0314    0.0508 
  42       0.01    0.725     0.544  0.596  0.0186       0.0315    0.0507 
  42       0.1     0.725     0.544  0.596  0.0185       0.0313    0.0506 
  42       1       0.725     0.544  0.595  0.0183       0.0311    0.0519 
  42       10      0.723     0.541  0.598  0.0203       0.0344    0.0522 
  47       0       0.727     0.548  0.578  0.0196       0.0325    0.0478 
  47       0.01    0.727     0.548  0.579  0.0193       0.0322    0.0486 
  47       0.1     0.727     0.548  0.579  0.0195       0.0325    0.0487 
  47       1       0.727     0.548  0.579  0.0194       0.0324    0.0491 
  47       10      0.725     0.546  0.584  0.0203       0.0336    0.0515 
  52       0       0.727     0.549  0.577  0.0206       0.0344    0.0476 
  52       0.01    0.727     0.549  0.577  0.0206       0.0344    0.0476 
  52       0.1     0.727     0.549  0.577  0.0205       0.0342    0.0475 
  52       1       0.727     0.548  0.577  0.021        0.0351    0.0483 
  52       10      0.725     0.546  0.579  0.0205       0.034     0.0495 
  57       0       0.73      0.553  0.573  0.0208       0.0348    0.0463 
  57       0.01    0.729     0.553  0.573  0.021        0.0351    0.0463 
  57       0.1     0.729     0.553  0.573  0.0209       0.035     0.0463 
  57       1       0.729     0.553  0.573  0.021        0.035     0.0455 
  57       10      0.728     0.551  0.574  0.021        0.0348    0.0474 
  62       0       0.736     0.565  0.56   0.0215       0.0359    0.0475 
  62       0.01    0.736     0.565  0.56   0.0215       0.0359    0.0475 
  62       0.1     0.736     0.565  0.56   0.0214       0.0357    0.0475 
  62       1       0.736     0.565  0.56   0.0211       0.0352    0.0475 
  62       10      0.733     0.56   0.563  0.021        0.0351    0.0485 
  67       0       0.742     0.576  0.549  0.0208       0.0344    0.0431 
  67       0.01    0.743     0.576  0.549  0.0208       0.0346    0.0432 
  67       0.1     0.743     0.576  0.549  0.0208       0.0345    0.0432 
  67       1       0.743     0.577  0.547  0.0212       0.0351    0.0449 
  67       10      0.739     0.57   0.553  0.0205       0.034     0.0452 
  72       0       0.747     0.585  0.539  0.0207       0.0346    0.0456 
  72       0.01    0.747     0.585  0.539  0.0207       0.0346    0.0456 
  72       0.1     0.747     0.585  0.539  0.0206       0.0344    0.0454 
  72       1       0.747     0.584  0.54   0.0205       0.0343    0.0447 
  72       10      0.743     0.578  0.546  0.0204       0.034     0.0432 
  77       0       0.751     0.591  0.534  0.0207       0.0347    0.042  
  77       0.01    0.751     0.591  0.534  0.0207       0.0347    0.042  
  77       0.1     0.751     0.591  0.534  0.0208       0.0348    0.0421 
  77       1       0.75      0.589  0.535  0.0213       0.0358    0.0429 
  77       10      0.747     0.584  0.54   0.0207       0.0345    0.0424 
  82       0       0.753     0.595  0.529  0.0196       0.0326    0.0409 
  82       0.01    0.753     0.595  0.529  0.0196       0.0326    0.041  
  82       0.1     0.753     0.595  0.529  0.0196       0.0326    0.0404 
  82       1       0.753     0.594  0.53   0.0199       0.0331    0.0399 
  82       10      0.748     0.586  0.537  0.0215       0.0359    0.0418 
  87       0       0.755     0.598  0.526  0.0202       0.0336    0.0428 
  87       0.01    0.755     0.598  0.526  0.0202       0.0336    0.0428 
  87       0.1     0.755     0.598  0.525  0.0203       0.0339    0.043  
  87       1       0.755     0.598  0.526  0.0202       0.0336    0.0412 
  87       10      0.75      0.59   0.532  0.0207       0.0347    0.0404 
  92       0       0.754     0.598  0.526  0.0214       0.0355    0.0451 
  92       0.01    0.754     0.598  0.527  0.0215       0.0357    0.045  
  92       0.1     0.755     0.598  0.526  0.0216       0.036     0.0452 
  92       1       0.754     0.598  0.526  0.0207       0.0345    0.0452 
  92       10      0.752     0.593  0.531  0.0213       0.0357    0.044  
  97       0       0.755     0.599  0.526  0.0217       0.0361    0.0452 
  97       0.01    0.755     0.599  0.526  0.0218       0.0363    0.0455 
  97       0.1     0.755     0.599  0.526  0.0218       0.0363    0.0455 
  97       1       0.755     0.599  0.525  0.0219       0.0363    0.0457 
  97       10      0.752     0.594  0.53   0.0217       0.0363    0.0444 
  102      0       0.754     0.598  0.527  0.0226       0.0377    0.0469 
  102      0.01    0.754     0.598  0.527  0.0224       0.0374    0.0467 
  102      0.1     0.754     0.598  0.527  0.0223       0.0373    0.0472 
  102      1       0.755     0.599  0.527  0.0224       0.0373    0.0475 
  102      10      0.753     0.595  0.53   0.0222       0.0371    0.0458 
  107      0       0.755     0.6    0.526  0.0232       0.0387    0.0497 
  107      0.01    0.755     0.6    0.526  0.0233       0.0389    0.0497 
  107      0.1     0.755     0.6    0.527  0.023        0.0383    0.0493 
  107      1       0.755     0.6    0.527  0.0225       0.0376    0.0479 
  107      10      0.753     0.597  0.53   0.0227       0.0378    0.0472 
  112      0       0.756     0.602  0.523  0.0232       0.0389    0.0495 
  112      0.01    0.756     0.602  0.523  0.0232       0.0388    0.0493 
  112      0.1     0.756     0.602  0.523  0.0232       0.0387    0.0501 
  112      1       0.756     0.601  0.524  0.0234       0.0391    0.0503 
  112      10      0.754     0.597  0.53   0.023        0.0384    0.0494 

Cost was used to select the optimal model using  the smallest value.
The final values used for the model were NumVars = 112 and lambda = 0. 
> 
> set.seed(857)
> nnetGrid <- expand.grid(decay = c(0, 0.001, 0.01, .1, .5),
+                         size = (1:10)*2 - 1)
> nnetFit <- train(modForm, 
+                  data = trainData,
+                  method = "nnet",
+                  metric = "Cost",
+                  maximize = FALSE,
+                  tuneGrid = nnetGrid,
+                  trace = FALSE,
+                  MaxNWts = 2000,
+                  maxit = 1000,
+                  preProc = c("center", "scale"),
+                  trControl = ctrl)
Loading required package: nnet
> nnetFit
Neural Network 

3467 samples
   7 predictors
   4 classes: 'VF', 'F', 'M', 'L' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ... 

Resampling results across tuning parameters:

  decay  size  Accuracy  Kappa  Cost   Accuracy SD  Kappa SD  Cost SD
  0      1     0.683     0.463  0.86   0.0295       0.0512    0.164  
  0      3     0.743     0.577  0.607  0.027        0.045     0.0789 
  0      5     0.757     0.605  0.524  0.0215       0.0354    0.0697 
  0      7     0.766     0.62   0.499  0.02         0.0324    0.0622 
  0      9     0.769     0.627  0.466  0.0216       0.0354    0.0547 
  0      11    0.774     0.635  0.452  0.0217       0.0351    0.0498 
  0      13    0.774     0.636  0.454  0.0202       0.0327    0.0561 
  0      15    0.768     0.626  0.455  0.0216       0.0345    0.0487 
  0      17    0.773     0.637  0.436  0.0209       0.0326    0.0459 
  0      19    0.772     0.633  0.437  0.019        0.0298    0.0391 
  0.001  1     0.694     0.486  0.769  0.0234       0.0403    0.104  
  0.001  3     0.749     0.588  0.591  0.0241       0.0394    0.066  
  0.001  5     0.766     0.619  0.513  0.02         0.0332    0.0617 
  0.001  7     0.778     0.64   0.485  0.0228       0.0377    0.067  
  0.001  9     0.782     0.647  0.452  0.0217       0.0357    0.0552 
  0.001  11    0.779     0.643  0.445  0.0211       0.034     0.0493 
  0.001  13    0.779     0.644  0.434  0.0216       0.0359    0.0592 
  0.001  15    0.779     0.644  0.432  0.0197       0.0313    0.0499 
  0.001  17    0.78      0.648  0.419  0.0212       0.0345    0.0457 
  0.001  19    0.777     0.643  0.417  0.0263       0.0416    0.061  
  0.01   1     0.694     0.488  0.74   0.022        0.0376    0.0522 
  0.01   3     0.756     0.601  0.585  0.0203       0.0336    0.0629 
  0.01   5     0.769     0.622  0.528  0.0238       0.0391    0.0735 
  0.01   7     0.778     0.64   0.475  0.0179       0.03      0.0513 
  0.01   9     0.782     0.648  0.448  0.021        0.0335    0.0482 
  0.01   11    0.785     0.653  0.437  0.0226       0.0367    0.0512 
  0.01   13    0.784     0.652  0.438  0.0204       0.0329    0.0501 
  0.01   15    0.784     0.652  0.428  0.0197       0.0318    0.0465 
  0.01   17    0.782     0.65   0.419  0.0184       0.0292    0.0441 
  0.01   19    0.787     0.658  0.412  0.0201       0.0318    0.0477 
  0.1    1     0.693     0.485  0.765  0.0202       0.0342    0.048  
  0.1    3     0.759     0.604  0.588  0.021        0.0351    0.0566 
  0.1    5     0.778     0.637  0.502  0.0233       0.0382    0.0622 
  0.1    7     0.784     0.649  0.474  0.0229       0.0375    0.06   
  0.1    9     0.794     0.665  0.434  0.0175       0.0283    0.0435 
  0.1    11    0.791     0.662  0.436  0.0228       0.0369    0.0553 
  0.1    13    0.793     0.665  0.425  0.0196       0.0322    0.0519 
  0.1    15    0.794     0.667  0.421  0.0228       0.0369    0.0552 
  0.1    17    0.796     0.671  0.407  0.0226       0.0362    0.0472 
  0.1    19    0.799     0.676  0.398  0.0214       0.034     0.0437 
  0.5    1     0.707     0.5    0.848  0.0199       0.0351    0.0551 
  0.5    3     0.756     0.598  0.606  0.0182       0.0304    0.0572 
  0.5    5     0.776     0.634  0.524  0.0196       0.0327    0.0518 
  0.5    7     0.785     0.649  0.499  0.0185       0.0301    0.0514 
  0.5    9     0.788     0.655  0.471  0.0177       0.0294    0.053  
  0.5    11    0.793     0.664  0.449  0.0195       0.0324    0.047  
  0.5    13    0.793     0.663  0.448  0.022        0.0357    0.0509 
  0.5    15    0.796     0.668  0.429  0.0201       0.0325    0.0434 
  0.5    17    0.795     0.668  0.435  0.0227       0.0375    0.0527 
  0.5    19    0.801     0.677  0.422  0.02         0.0326    0.0492 

Cost was used to select the optimal model using  the smallest value.
The final values used for the model were size = 19 and decay = 0.1. 
> 
> set.seed(857)
> plsFit <- train(x = expandedTrain,
+                 y = trainData$Class,
+                 method = "pls",
+                 metric = "Cost",
+                 maximize = FALSE,
+                 tuneLength = 100,
+                 preProc = c("center", "scale"),
+                 trControl = ctrl)
Loading required package: pls

Attaching package: ‘pls’

The following object is masked from ‘package:caret’:

    R2

The following object is masked from ‘package:stats’:

    loadings

> plsFit
Partial Least Squares 

3467 samples
 112 predictors
   4 classes: 'VF', 'F', 'M', 'L' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ... 

Resampling results across tuning parameters:

  ncomp  Accuracy  Kappa  Cost   Accuracy SD  Kappa SD  Cost SD
  1      0.645     0.352  0.998  0.0172       0.0342    0.0282 
  2      0.638     0.342  1.03   0.016        0.031     0.0264 
  3      0.646     0.357  1.02   0.0158       0.0311    0.0244 
  4      0.649     0.369  0.974  0.0162       0.0316    0.0408 
  5      0.662     0.4    0.921  0.0169       0.0319    0.0365 
  6      0.676     0.43   0.878  0.0195       0.0359    0.0485 
  7      0.677     0.434  0.853  0.0197       0.0363    0.0499 
  8      0.682     0.445  0.828  0.0203       0.0376    0.0532 
  9      0.689     0.457  0.796  0.0194       0.0358    0.0483 
  10     0.691     0.463  0.788  0.0194       0.0361    0.0515 
  11     0.692     0.467  0.776  0.0202       0.037     0.046  
  12     0.698     0.479  0.768  0.0196       0.0356    0.0496 
  13     0.7       0.484  0.761  0.0196       0.0352    0.0487 
  14     0.701     0.485  0.768  0.0196       0.0347    0.0493 
  15     0.701     0.486  0.766  0.0201       0.0362    0.051  
  16     0.704     0.492  0.761  0.0208       0.037     0.0504 
  17     0.707     0.497  0.761  0.0209       0.0376    0.0496 
  18     0.706     0.496  0.759  0.0194       0.0347    0.0527 
  19     0.707     0.498  0.756  0.0212       0.0376    0.0543 
  20     0.71      0.503  0.75   0.0186       0.0332    0.0486 
  21     0.716     0.514  0.74   0.0196       0.0347    0.052  
  22     0.719     0.519  0.734  0.0193       0.0344    0.0512 
  23     0.729     0.537  0.725  0.0184       0.0324    0.0485 
  24     0.726     0.533  0.731  0.0202       0.0355    0.0512 
  25     0.727     0.536  0.712  0.0198       0.0349    0.0489 
  26     0.727     0.536  0.711  0.0218       0.0381    0.0495 
  27     0.728     0.539  0.708  0.0205       0.0363    0.0495 
  28     0.728     0.539  0.703  0.0205       0.0361    0.0525 
  29     0.728     0.54   0.704  0.021        0.037     0.0514 
  30     0.73      0.543  0.698  0.0215       0.0378    0.0515 
  31     0.731     0.546  0.695  0.0213       0.0373    0.0499 
  32     0.732     0.547  0.693  0.0225       0.0393    0.0497 
  33     0.734     0.551  0.688  0.0216       0.0378    0.0487 
  34     0.736     0.553  0.684  0.0216       0.0377    0.0497 
  35     0.737     0.556  0.683  0.0198       0.0348    0.0464 
  36     0.739     0.559  0.677  0.0202       0.0353    0.0469 
  37     0.74      0.56   0.675  0.0217       0.0378    0.0503 
  38     0.74      0.561  0.673  0.0199       0.0345    0.049  
  39     0.742     0.564  0.669  0.0203       0.0354    0.0509 
  40     0.741     0.563  0.67   0.019        0.0333    0.0491 
  41     0.742     0.564  0.667  0.0196       0.034     0.0492 
  42     0.742     0.564  0.666  0.0197       0.0342    0.0509 
  43     0.742     0.565  0.662  0.0203       0.0352    0.0507 
  44     0.743     0.567  0.661  0.0202       0.0349    0.0499 
  45     0.743     0.567  0.658  0.0203       0.0354    0.0501 
  46     0.743     0.568  0.657  0.0205       0.0356    0.0503 
  47     0.743     0.568  0.655  0.0203       0.0352    0.0494 
  48     0.745     0.571  0.65   0.02         0.0347    0.0497 
  49     0.744     0.57   0.652  0.0201       0.0349    0.0507 
  50     0.745     0.571  0.65   0.0199       0.0344    0.0491 
  51     0.744     0.569  0.652  0.0197       0.0339    0.0495 
  52     0.744     0.57   0.65   0.0197       0.0341    0.0494 
  53     0.745     0.571  0.649  0.0207       0.0357    0.0512 
  54     0.745     0.572  0.648  0.0204       0.0351    0.0499 
  55     0.745     0.572  0.648  0.0203       0.0349    0.0507 
  56     0.745     0.572  0.647  0.0196       0.0337    0.051  
  57     0.746     0.573  0.644  0.0194       0.0332    0.0481 
  58     0.745     0.572  0.646  0.0191       0.0328    0.0487 
  59     0.745     0.573  0.645  0.0197       0.034     0.05   
  60     0.746     0.573  0.644  0.0198       0.0342    0.0504 
  61     0.746     0.574  0.642  0.0194       0.0335    0.0495 
  62     0.746     0.574  0.641  0.0201       0.0347    0.0499 
  63     0.746     0.574  0.641  0.0206       0.0355    0.0505 
  64     0.747     0.575  0.641  0.0201       0.0347    0.05   
  65     0.747     0.575  0.64   0.0206       0.0354    0.0491 
  66     0.747     0.576  0.638  0.02         0.0345    0.0492 
  67     0.747     0.576  0.639  0.0203       0.0349    0.0488 
  68     0.747     0.576  0.639  0.0202       0.0347    0.0487 
  69     0.747     0.575  0.64   0.0204       0.0351    0.0502 
  70     0.747     0.576  0.639  0.0198       0.034     0.0491 
  71     0.747     0.576  0.638  0.0201       0.0345    0.0486 
  72     0.748     0.577  0.636  0.0201       0.0346    0.05   
  73     0.748     0.577  0.637  0.0201       0.0345    0.0496 
  74     0.748     0.577  0.637  0.0205       0.0354    0.0516 
  75     0.747     0.576  0.638  0.0207       0.0357    0.0523 
  76     0.747     0.576  0.639  0.0205       0.0353    0.0511 
  77     0.747     0.576  0.639  0.0201       0.0346    0.0501 
  78     0.747     0.576  0.639  0.02         0.0345    0.0506 
  79     0.747     0.575  0.639  0.0198       0.0341    0.0491 
  80     0.747     0.575  0.64   0.0197       0.034     0.0495 
  81     0.747     0.575  0.641  0.02         0.0344    0.0494 
  82     0.747     0.575  0.641  0.0203       0.035     0.0498 
  83     0.747     0.575  0.641  0.0201       0.0347    0.0494 
  84     0.747     0.575  0.641  0.0203       0.0349    0.0496 
  85     0.747     0.575  0.641  0.0203       0.035     0.0497 
  86     0.747     0.575  0.641  0.0198       0.0341    0.0494 
  87     0.747     0.575  0.641  0.0201       0.0346    0.0499 
  88     0.747     0.575  0.641  0.0202       0.0348    0.0499 
  89     0.747     0.575  0.641  0.0203       0.0349    0.0498 
  90     0.747     0.575  0.641  0.0203       0.035     0.0499 
  91     0.747     0.575  0.64   0.0204       0.0351    0.0501 
  92     0.747     0.575  0.641  0.0204       0.035     0.0498 
  93     0.747     0.575  0.641  0.0205       0.0353    0.0499 
  94     0.747     0.575  0.641  0.0206       0.0353    0.0499 
  95     0.747     0.575  0.641  0.0206       0.0354    0.0499 
  96     0.747     0.575  0.641  0.0205       0.0352    0.0498 
  97     0.747     0.575  0.641  0.0205       0.0352    0.0498 
  98     0.747     0.575  0.641  0.0205       0.0352    0.0498 
  99     0.747     0.575  0.641  0.0205       0.0352    0.0498 
  100    0.747     0.575  0.641  0.0206       0.0353    0.0499 

Cost was used to select the optimal model using  the smallest value.
The final value used for the model was ncomp = 72. 
> 
> set.seed(857)
> fdaFit <- train(modForm, data = trainData,
+                 method = "fda",
+                 metric = "Cost",
+                 maximize = FALSE,
+                 tuneLength = 25,
+                 trControl = ctrl)
Loading required package: earth
Loading required package: plotmo
Loading required package: plotrix
> fdaFit
Flexible Discriminant Analysis 

3467 samples
   7 predictors
   4 classes: 'VF', 'F', 'M', 'L' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ... 

Resampling results across tuning parameters:

  nprune  Accuracy  Kappa   Cost   Accuracy SD  Kappa SD  Cost SD
  2       0.524     0.0711  0.929  0.00646      0.021     0.0455 
  3       0.541     0.142   0.843  0.00898      0.0221    0.0368 
  4       0.61      0.298   0.79   0.0143       0.03      0.0412 
  5       0.659     0.405   0.753  0.0156       0.03      0.042  
  6       0.678     0.451   0.75   0.018        0.0324    0.0468 
  7       0.684     0.466   0.699  0.0174       0.0305    0.0513 
  8       0.693     0.487   0.64   0.0206       0.0359    0.0522 
  9       0.695     0.491   0.634  0.0214       0.0369    0.0549 
  10      0.698     0.496   0.631  0.021        0.0363    0.0551 
  11      0.71      0.518   0.62   0.0224       0.0382    0.0575 
  12      0.713     0.524   0.617  0.0204       0.0351    0.054  
  13      0.715     0.529   0.612  0.0229       0.0388    0.0584 
  14      0.724     0.544   0.602  0.0222       0.0375    0.0593 
  15      0.726     0.547   0.602  0.019        0.0328    0.0567 
  16      0.727     0.548   0.602  0.0202       0.0344    0.0559 
  17      0.725     0.545   0.608  0.019        0.033     0.0571 
  18      0.726     0.547   0.606  0.0205       0.0352    0.0588 
  19      0.727     0.548   0.607  0.0206       0.0348    0.0598 
  20      0.727     0.549   0.606  0.0208       0.0353    0.0596 
  21      0.729     0.552   0.602  0.0213       0.0358    0.0572 
  22      0.731     0.555   0.6    0.0213       0.0361    0.0583 
  23      0.732     0.557   0.598  0.0202       0.0343    0.0562 

Tuning parameter 'degree' was held constant at a value of 1
Cost was used to select the optimal model using  the smallest value.
The final values used for the model were degree = 1 and nprune = 23. 
> 
> set.seed(857)
> rfFit <- train(x = trainData[, predictors],
+                y = trainData$Class,
+                method = "rf",
+                metric = "Cost",
+                maximize = FALSE,
+                tuneLength = 10,
+                ntree = 2000,
+                importance = TRUE,
+                trControl = ctrl)
Loading required package: randomForest
randomForest 4.6-7
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:Hmisc’:

    combine

note: only 6 unique complexity parameters in default grid. Truncating the grid to 6 .

> rfFit
Random Forest 

3467 samples
   7 predictors
   4 classes: 'VF', 'F', 'M', 'L' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ... 

Resampling results across tuning parameters:

  mtry  Accuracy  Kappa  Cost   Accuracy SD  Kappa SD  Cost SD
  2     0.842     0.743  0.336  0.0168       0.0275    0.042  
  3     0.845     0.748  0.328  0.0176       0.0289    0.0419 
  4     0.845     0.748  0.326  0.0173       0.0282    0.0434 
  5     0.843     0.746  0.328  0.0166       0.0272    0.0443 
  6     0.843     0.745  0.328  0.0172       0.0282    0.0462 
  7     0.842     0.744  0.328  0.0171       0.0279    0.0437 

Cost was used to select the optimal model using  the smallest value.
The final value used for the model was mtry = 4. 
> 
> set.seed(857)
> rfFitCost <- train(x = trainData[, predictors],
+                    y = trainData$Class,
+                    method = "rf",
+                    metric = "Cost",
+                    maximize = FALSE,
+                    tuneLength = 10,
+                    ntree = 2000,
+                    classwt = c(VF = 1, F = 1, M = 5, L = 10),
+                    importance = TRUE,
+                    trControl = ctrl)
note: only 6 unique complexity parameters in default grid. Truncating the grid to 6 .

> rfFitCost
Random Forest 

3467 samples
   7 predictors
   4 classes: 'VF', 'F', 'M', 'L' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ... 

Resampling results across tuning parameters:

  mtry  Accuracy  Kappa  Cost   Accuracy SD  Kappa SD  Cost SD
  2     0.84      0.739  0.34   0.0171       0.0281    0.0452 
  3     0.843     0.745  0.345  0.0159       0.0259    0.0413 
  4     0.844     0.746  0.345  0.016        0.0263    0.0439 
  5     0.844     0.747  0.341  0.0182       0.0298    0.0459 
  6     0.846     0.75   0.337  0.0168       0.0275    0.0432 
  7     0.845     0.748  0.337  0.0169       0.0274    0.0416 

Cost was used to select the optimal model using  the smallest value.
The final value used for the model was mtry = 7. 
> 
> c5Grid <- expand.grid(trials = c(1, (1:10)*10),
+                       model = "tree",
+                       winnow = c(TRUE, FALSE))
> set.seed(857)
> c50Fit <- train(x = trainData[, predictors],
+                 y = trainData$Class,
+                 method = "C5.0",
+                 metric = "Cost",
+                 maximize = FALSE,
+                 tuneGrid = c5Grid,
+                 trControl = ctrl)
Loading required package: C50
Loading required package: plyr

Attaching package: ‘plyr’

The following object is masked from ‘package:Hmisc’:

    is.discrete, summarize

> c50Fit
C5.0 

3467 samples
   7 predictors
   4 classes: 'VF', 'F', 'M', 'L' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ... 

Resampling results across tuning parameters:

  winnow  trials  Accuracy  Kappa  Cost   Accuracy SD  Kappa SD  Cost SD
  FALSE   1       0.801     0.677  0.396  0.0183       0.0305    0.0476 
  FALSE   10      0.833     0.729  0.338  0.0185       0.0308    0.047  
  FALSE   20      0.836     0.735  0.326  0.0177       0.0291    0.0471 
  FALSE   30      0.839     0.739  0.324  0.0175       0.0288    0.0439 
  FALSE   40      0.839     0.739  0.324  0.0177       0.0289    0.0433 
  FALSE   50      0.839     0.739  0.322  0.0175       0.0286    0.0451 
  FALSE   60      0.839     0.74   0.322  0.0185       0.0303    0.0444 
  FALSE   70      0.84      0.741  0.32   0.0165       0.0271    0.0432 
  FALSE   80      0.84      0.741  0.319  0.0171       0.0281    0.0431 
  FALSE   90      0.841     0.743  0.318  0.0163       0.027     0.044  
  FALSE   100     0.841     0.742  0.32   0.016        0.0263    0.0432 
  TRUE    1       0.801     0.678  0.397  0.018        0.0299    0.0463 
  TRUE    10      0.832     0.727  0.34   0.0182       0.0302    0.0484 
  TRUE    20      0.834     0.732  0.327  0.0176       0.0288    0.048  
  TRUE    30      0.837     0.737  0.323  0.0168       0.0276    0.0456 
  TRUE    40      0.838     0.737  0.323  0.0167       0.0272    0.0443 
  TRUE    50      0.838     0.737  0.32   0.0164       0.0267    0.0451 
  TRUE    60      0.839     0.739  0.32   0.017        0.0276    0.0436 
  TRUE    70      0.839     0.739  0.319  0.0158       0.0258    0.0436 
  TRUE    80      0.839     0.74   0.318  0.0161       0.0264    0.0438 
  TRUE    90      0.84      0.741  0.317  0.0161       0.0265    0.0453 
  TRUE    100     0.841     0.742  0.317  0.0158       0.0259    0.0451 

Tuning parameter 'model' was held constant at a value of tree
Cost was used to select the optimal model using  the smallest value.
The final values used for the model were trials = 90, model = tree and winnow
 = TRUE. 
> 
> set.seed(857)
> c50Cost <- train(x = trainData[, predictors],
+                  y = trainData$Class,
+                  method = "C5.0",
+                  metric = "Cost",
+                  maximize = FALSE,
+                  costs = costMatrix,
+                  tuneGrid = c5Grid,
+                  trControl = ctrl)
> c50Cost
C5.0 

3467 samples
   7 predictors
   4 classes: 'VF', 'F', 'M', 'L' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ... 

Resampling results across tuning parameters:

  winnow  trials  Accuracy  Kappa  Cost   Accuracy SD  Kappa SD  Cost SD
  FALSE   1       0.796     0.667  0.462  0.0185       0.0312    0.0526 
  FALSE   10      0.829     0.723  0.346  0.0188       0.0311    0.047  
  FALSE   20      0.834     0.731  0.33   0.0204       0.0337    0.0506 
  FALSE   30      0.835     0.733  0.325  0.0192       0.0318    0.048  
  FALSE   40      0.835     0.733  0.322  0.018        0.0297    0.0433 
  FALSE   50      0.836     0.735  0.318  0.0192       0.0316    0.0442 
  FALSE   60      0.836     0.734  0.318  0.0186       0.0307    0.045  
  FALSE   70      0.837     0.736  0.315  0.0181       0.0299    0.0454 
  FALSE   80      0.837     0.737  0.314  0.0189       0.031     0.0461 
  FALSE   90      0.839     0.739  0.314  0.0178       0.0293    0.0462 
  FALSE   100     0.839     0.74   0.317  0.0183       0.0302    0.0483 
  TRUE    1       0.773     0.624  0.554  0.0368       0.0694    0.128  
  TRUE    10      0.793     0.658  0.461  0.0511       0.094     0.174  
  TRUE    20      0.796     0.663  0.449  0.0524       0.0963    0.179  
  TRUE    30      0.797     0.664  0.446  0.0529       0.097     0.181  
  TRUE    40      0.796     0.664  0.446  0.0527       0.0967    0.181  
  TRUE    50      0.796     0.663  0.445  0.0525       0.0964    0.181  
  TRUE    60      0.796     0.663  0.444  0.0523       0.0962    0.182  
  TRUE    70      0.796     0.664  0.443  0.0522       0.096     0.182  
  TRUE    80      0.798     0.666  0.441  0.0533       0.0977    0.184  
  TRUE    90      0.799     0.668  0.441  0.0542       0.0991    0.184  
  TRUE    100     0.799     0.668  0.442  0.0542       0.0991    0.183  

Tuning parameter 'model' was held constant at a value of tree
Cost was used to select the optimal model using  the smallest value.
The final values used for the model were trials = 90, model = tree and winnow
 = FALSE. 
> 
> set.seed(857)
> bagFit <- train(x = trainData[, predictors],
+                 y = trainData$Class,
+                 method = "treebag",
+                 metric = "Cost",
+                 maximize = FALSE,
+                 nbagg = 50,
+                 trControl = ctrl)
Loading required package: ipred
Loading required package: prodlim
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
> bagFit
Bagged CART 

3467 samples
   7 predictors
   4 classes: 'VF', 'F', 'M', 'L' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ... 

Resampling results

  Accuracy  Kappa  Cost   Accuracy SD  Kappa SD  Cost SD
  0.836     0.735  0.333  0.0155       0.0249    0.0417 

 
> 
> ### Use the caret bag() function to bag the cost-sensitive CART model
> rpCost <- function(x, y)
+ {
+   costMatrix <- ifelse(diag(4) == 1, 0, 1)
+   costMatrix[4, 1] <- 10
+   costMatrix[3, 1] <- 5
+   costMatrix[4, 2] <- 5
+   costMatrix[3, 2] <- 5
+   library(rpart)
+   tmp <- x
+   tmp$y <- y
+   rpart(y~., data = tmp, control = rpart.control(cp = 0),
+         parms =list(loss = costMatrix))
+ }
> rpPredict <- function(object, x) predict(object, x)
> 
> rpAgg <- function (x, type = "class")
+ {
+   pooled <- x[[1]] * NA
+   n <- nrow(pooled)
+   classes <- colnames(pooled)
+   for (i in 1:ncol(pooled))
+   {
+     tmp <- lapply(x, function(y, col) y[, col], col = i)
+     tmp <- do.call("rbind", tmp)
+     pooled[, i] <- apply(tmp, 2, median)
+   }
+   pooled <- apply(pooled, 1, function(x) x/sum(x))
+   if (n != nrow(pooled)) pooled <- t(pooled)
+   out <- factor(classes[apply(pooled, 1, which.max)], levels = classes)
+   out
+ }
> 
> 
> set.seed(857)
> rpCostBag <- train(trainData[, predictors],
+                    trainData$Class,
+                    "bag",
+                    B = 50,
+                    bagControl = bagControl(fit = rpCost,
+                                            predict = rpPredict,
+                                            aggregate = rpAgg,
+                                            downSample = FALSE,
+                                            allowParallel = FALSE),
+                    trControl = ctrl)
> rpCostBag
Bagged Model 

3467 samples
   7 predictors
   4 classes: 'VF', 'F', 'M', 'L' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ... 

Resampling results

  Accuracy  Kappa  Cost   Accuracy SD  Kappa SD  Cost SD
  0.807     0.689  0.369  0.0163       0.0263    0.0446 

Tuning parameter 'vars' was held constant at a value of 7
 
> 
> set.seed(857)
> svmRFit <- train(modForm ,
+                  data = trainData,
+                  method = "svmRadial",
+                  metric = "Cost",
+                  maximize = FALSE,
+                  preProc = c("center", "scale"),
+                  tuneLength = 15,
+                  trControl = ctrl)
Loading required package: kernlab
> svmRFit
Support Vector Machines with Radial Basis Function Kernel 

3467 samples
   7 predictors
   4 classes: 'VF', 'F', 'M', 'L' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ... 

Resampling results across tuning parameters:

  C     Accuracy  Kappa  Cost   Accuracy SD  Kappa SD  Cost SD
  0.25  0.704     0.486  0.853  0.0168       0.0299    0.0405 
  0.5   0.744     0.568  0.671  0.0202       0.0352    0.0548 
  1     0.77      0.618  0.562  0.0193       0.0332    0.0494 
  2     0.784     0.644  0.522  0.0207       0.0347    0.0476 
  4     0.791     0.658  0.49   0.0194       0.0322    0.044  
  8     0.797     0.668  0.456  0.0181       0.0297    0.0391 
  16    0.799     0.673  0.438  0.0184       0.0299    0.0413 
  32    0.801     0.677  0.424  0.0183       0.0296    0.0394 
  64    0.802     0.679  0.415  0.0183       0.0298    0.0446 
  128   0.802     0.68   0.404  0.0202       0.0331    0.0495 
  256   0.805     0.684  0.393  0.022        0.0363    0.0522 
  512   0.807     0.689  0.385  0.021        0.0345    0.0533 
  1020  0.808     0.69   0.38   0.0212       0.0345    0.0543 
  2050  0.804     0.684  0.387  0.0218       0.0353    0.0518 
  4100  0.802     0.679  0.391  0.0199       0.0324    0.0489 

Tuning parameter 'sigma' was held constant at a value of 0.03332721
Cost was used to select the optimal model using  the smallest value.
The final values used for the model were sigma = 0.0333 and C = 1024. 
> 
> set.seed(857)
> svmRFitCost <- train(modForm, data = trainData,
+                      method = "svmRadial",
+                      metric = "Cost",
+                      maximize = FALSE,
+                      preProc = c("center", "scale"),
+                      class.weights = c(VF = 1, F = 1, M = 5, L = 10),
+                      tuneLength = 15,
+                      trControl = ctrl)
> svmRFitCost
Support Vector Machines with Radial Basis Function Kernel 

3467 samples
   7 predictors
   4 classes: 'VF', 'F', 'M', 'L' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ... 

Resampling results across tuning parameters:

  C     Accuracy  Kappa  Cost   Accuracy SD  Kappa SD  Cost SD
  0.25  0.681     0.513  0.378  0.0227       0.0333    0.0402 
  0.5   0.703     0.543  0.365  0.0201       0.0303    0.0354 
  1     0.726     0.576  0.347  0.0185       0.0278    0.0321 
  2     0.744     0.602  0.337  0.0179       0.0272    0.0356 
  4     0.753     0.614  0.339  0.0161       0.0244    0.0304 
  8     0.762     0.626  0.34   0.0165       0.0258    0.0395 
  16    0.77      0.637  0.347  0.0182       0.0288    0.0411 
  32    0.777     0.647  0.346  0.0186       0.0292    0.0446 
  64    0.783     0.655  0.35   0.0209       0.0331    0.0481 
  128   0.787     0.661  0.359  0.0223       0.0356    0.0517 
  256   0.79      0.665  0.36   0.0231       0.0371    0.0515 
  512   0.791     0.666  0.37   0.0235       0.0379    0.0521 
  1020  0.794     0.669  0.376  0.0222       0.0358    0.0534 
  2050  0.795     0.671  0.378  0.0224       0.0363    0.0517 
  4100  0.793     0.667  0.389  0.0202       0.0325    0.0503 

Tuning parameter 'sigma' was held constant at a value of 0.03332721
Cost was used to select the optimal model using  the smallest value.
The final values used for the model were sigma = 0.0333 and C = 2. 
> 
> modelList <- list(C5.0 = c50Fit,
+                   "C5.0 (Costs)" = c50Cost,
+                   CART =rpFit,
+                   "CART (Costs)" = rpFitCost,
+                   "Bagging (Costs)" = rpCostBag,
+                   FDA = fdaFit,
+                   SVM = svmRFit,
+                   "SVM (Weights)" = svmRFitCost,
+                   PLS = plsFit,
+                   "Random Forests" = rfFit,
+                   LDA = ldaFit,
+                   "LDA (Sparse)" = sldaFit,
+                   "Neural Networks" = nnetFit,
+                   Bagging = bagFit)
> 
> 
> ################################################################################
> ### Section 17.2 Results
> 
> rs <- resamples(modelList)
> summary(rs)

Call:
summary.resamples(object = rs)

Models: C5.0, C5.0 (Costs), CART, CART (Costs), Bagging (Costs), FDA, SVM, SVM (Weights), PLS, Random Forests, LDA, LDA (Sparse), Neural Networks, Bagging 
Number of resamples: 50 

Accuracy 
                  Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
C5.0            0.8040  0.8278 0.8427 0.8404  0.8473 0.8736    0
C5.0 (Costs)    0.8069  0.8249 0.8357 0.8387  0.8500 0.8757    0
CART            0.7328  0.7637 0.7723 0.7738  0.7859 0.8242    0
CART (Costs)    0.6888  0.7081 0.7201 0.7199  0.7312 0.7550    0
Bagging (Costs) 0.7637  0.7949 0.8092 0.8065  0.8173 0.8329    0
FDA             0.6686  0.7199 0.7309 0.7315  0.7457 0.7723    0
SVM             0.7579  0.7961 0.8055 0.8076  0.8202 0.8555    0
SVM (Weights)   0.7069  0.7320 0.7435 0.7444  0.7543 0.7896    0
PLS             0.7061  0.7351 0.7460 0.7478  0.7608 0.7960    0
Random Forests  0.8017  0.8324 0.8444 0.8447  0.8559 0.8844    0
LDA             0.7176  0.7389 0.7511 0.7560  0.7752 0.8132    0
LDA (Sparse)    0.7176  0.7389 0.7511 0.7560  0.7752 0.8132    0
Neural Networks 0.7522  0.7844 0.7991 0.7990  0.8143 0.8621    0
Bagging         0.8069  0.8262 0.8372 0.8361  0.8473 0.8671    0

Kappa 
                  Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
C5.0            0.6792  0.7208 0.7450 0.7414  0.7545 0.7951    0
C5.0 (Costs)    0.6860  0.7158 0.7361 0.7387  0.7600 0.7979    0
CART            0.5655  0.6118 0.6297 0.6314  0.6505 0.7165    0
CART (Costs)    0.5193  0.5465 0.5669 0.5649  0.5825 0.6187    0
Bagging (Costs) 0.6170  0.6694 0.6922 0.6891  0.7067 0.7339    0
FDA             0.4497  0.5381 0.5535 0.5571  0.5819 0.6308    0
SVM             0.6087  0.6739 0.6869 0.6895  0.7095 0.7655    0
SVM (Weights)   0.5428  0.5855 0.5990 0.6017  0.6151 0.6699    0
PLS             0.5080  0.5558 0.5740 0.5768  0.6010 0.6598    0
Random Forests  0.6784  0.7282 0.7477 0.7477  0.7655 0.8107    0
LDA             0.5401  0.5712 0.5931 0.6020  0.6361 0.6968    0
LDA (Sparse)    0.5401  0.5712 0.5931 0.6020  0.6361 0.6968    0
Neural Networks 0.6028  0.6512 0.6761 0.6761  0.6980 0.7765    0
Bagging         0.6830  0.7168 0.7346 0.7346  0.7533 0.7844    0

Cost 
                  Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
C5.0            0.2254  0.2919 0.3146 0.3172  0.3357 0.4265    0
C5.0 (Costs)    0.2283  0.2795 0.3112 0.3138  0.3472 0.4195    0
CART            0.3718  0.4761 0.5144 0.5095  0.5465 0.6580    0
CART (Costs)    0.2882  0.3220 0.3425 0.3427  0.3613 0.4236    0
Bagging (Costs) 0.2803  0.3345 0.3646 0.3693  0.3954 0.5130    0
FDA             0.4813  0.5552 0.5908 0.5983  0.6433 0.7118    0
SVM             0.2717  0.3465 0.3790 0.3802  0.4022 0.5260    0
SVM (Weights)   0.2565  0.3134 0.3309 0.3367  0.3598 0.4265    0
PLS             0.5562  0.5937 0.6297 0.6364  0.6712 0.7435    0
Random Forests  0.2543  0.2997 0.3184 0.3259  0.3429 0.4265    0
LDA             0.4150  0.4913 0.5237 0.5229  0.5632 0.6254    0
LDA (Sparse)    0.4150  0.4913 0.5237 0.5229  0.5632 0.6254    0
Neural Networks 0.3055  0.3729 0.3988 0.3981  0.4261 0.5029    0
Bagging         0.2630  0.3057 0.3261 0.3326  0.3581 0.4467    0

> 
> confusionMatrix(rpFitCost, "none")
Cross-Validated (10 fold, repeated 5 times) Confusion Matrix 

(entries are un-normalized counts)
 
          Reference
Prediction    VF     F     M     L
        VF 157.5  25.6   1.9   0.2
        F   10.0  43.1   3.3   0.2
        M    9.4  37.0  34.3   5.7
        L    0.1   2.0   1.7  14.7

> confusionMatrix(rfFit, "none") 
Cross-Validated (10 fold, repeated 5 times) Confusion Matrix 

(entries are un-normalized counts)
 
          Reference
Prediction    VF     F     M     L
        VF 164.8  17.9   1.3   0.2
        F   12.0  83.8  11.6   1.9
        M    0.2   5.5  27.3   1.8
        L    0.0   0.6   1.0  16.9

> 
> plot(bwplot(rs, metric = "Cost"))
> 
> rfPred <- predict(rfFit, testData)
> rpPred <- predict(rpFitCost, testData)
> 
> confusionMatrix(rfPred, testData$Class)
Confusion Matrix and Statistics

          Reference
Prediction  VF   F   M   L
        VF 414  45   3   0
        F   28 206  27   5
        M    0  18  71   6
        L    0   0   1  40

Overall Statistics
                                          
               Accuracy : 0.8461          
                 95% CI : (0.8202, 0.8695)
    No Information Rate : 0.5116          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7496          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: VF Class: F Class: M Class: L
Sensitivity             0.9367   0.7658  0.69608  0.78431
Specificity             0.8863   0.8992  0.96850  0.99877
Pos Pred Value          0.8961   0.7744  0.74737  0.97561
Neg Pred Value          0.9303   0.8946  0.95969  0.98663
Prevalence              0.5116   0.3113  0.11806  0.05903
Detection Rate          0.4792   0.2384  0.08218  0.04630
Detection Prevalence    0.5347   0.3079  0.10995  0.04745
Balanced Accuracy       0.9115   0.8325  0.83229  0.89154
> confusionMatrix(rpPred, testData$Class)
Confusion Matrix and Statistics

          Reference
Prediction  VF   F   M   L
        VF 383  61   5   1
        F   32 106   7   2
        M   26  99  87  15
        L    1   3   3  33

Overall Statistics
                                          
               Accuracy : 0.7049          
                 95% CI : (0.6732, 0.7351)
    No Information Rate : 0.5116          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5437          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: VF Class: F Class: M Class: L
Sensitivity             0.8665   0.3941   0.8529  0.64706
Specificity             0.8412   0.9311   0.8163  0.99139
Pos Pred Value          0.8511   0.7211   0.3833  0.82500
Neg Pred Value          0.8575   0.7727   0.9765  0.97816
Prevalence              0.5116   0.3113   0.1181  0.05903
Detection Rate          0.4433   0.1227   0.1007  0.03819
Detection Prevalence    0.5208   0.1701   0.2627  0.04630
Balanced Accuracy       0.8539   0.6626   0.8346  0.81922
> 
> 
> ################################################################################
> ### Session Information
> 
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] parallel  grid      tools     splines   stats     graphics  grDevices
 [8] utils     datasets  methods   base     

other attached packages:
 [1] kernlab_0.9-18                  ipred_0.9-1                    
 [3] prodlim_1.3.7                   plyr_1.8                       
 [5] C50_0.1.0-15                    randomForest_4.6-7             
 [7] earth_3.2-6                     plotrix_3.4-7                  
 [9] plotmo_1.3-2                    pls_2.3-0                      
[11] nnet_7.3-6                      sparseLDA_0.1-6                
[13] mda_0.4-2                       elasticnet_1.1                 
[15] lars_1.2                        MASS_7.3-26                    
[17] e1071_1.6-1                     class_7.3-7                    
[19] rpart_4.1-1                     doMC_1.3.0                     
[21] iterators_1.0.6                 foreach_1.4.0                  
[23] caret_6.0-22                    ggplot2_0.9.3.1                
[25] lattice_0.20-15                 tabplot_1.0                    
[27] ffbase_0.8                      ff_2.2-11                      
[29] bit_1.1-10                      Hmisc_3.10-1.1                 
[31] survival_2.37-4                 AppliedPredictiveModeling_1.1-5

loaded via a namespace (and not attached):
 [1] car_2.0-17         cluster_1.14.4     codetools_0.2-8    colorspace_1.2-2  
 [5] compiler_3.0.1     CORElearn_0.9.41   dichromat_2.0-0    digest_0.6.3      
 [9] gtable_0.1.2       KernSmooth_2.23-10 labeling_0.1       munsell_0.4       
[13] proto_0.3-10       RColorBrewer_1.0-5 reshape2_1.2.2     scales_0.2.3      
[17] stringr_0.6.2     
> 
> q("no")
> proc.time()
     user    system   elapsed 
492217.97  31824.96  39801.06 
In [83]:
%%R -w 600 -h 600

## runChapterScript(17)

##       user    system   elapsed 
##  492217.97  31824.96  39801.06
NULL
In [90]:
%%R

showChapterScript(18)
NULL
In [85]:
%%R

showChapterOutput(18)
R Information
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com) 
> ###
> ### Chapter 18: Measuring Predictor Importance
> ###
> ### Required packages: AppliedPredictiveModeling, caret, CORElearn, corrplot,
> ###                    pROC, minerva
> ###                   
> ###
> ### Data used: The solubility data from the AppliedPredictiveModeling 
> ###            package, the segmentation data in the caret package and the 
> ###            grant data (created using "CreateGrantData.R" in the same
> ###            directory as this file).
> ###
> ### Notes: 
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be 
> ### syntax differences that occur over time as packages evolve. These files 
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in 
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
> 
> 
> 
> ################################################################################
> ### Section 18.1 Numeric Outcomes
> 
> ## Load the solubility data
> 
> library(AppliedPredictiveModeling)
> data(solubility)
> 
> trainData <- solTrainXtrans
> trainData$y <- solTrainY
> 
> 
> ## keep the continuous predictors and append the outcome to the data frame
> SolContPred <- solTrainXtrans[, !grepl("FP", names(solTrainXtrans))]
> numSolPred <- ncol(SolContPred)
> SolContPred$Sol <- solTrainY
> 
> ## Get the LOESS smoother and the summary measure
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> smoother <- filterVarImp(x = SolContPred[, -ncol(SolContPred)], 
+                          y = solTrainY, 
+                          nonpara = TRUE)
Loading required package: pROC
Loading required package: plyr
Type 'citation("pROC")' for a citation.

Attaching package: ‘pROC’

The following object is masked from ‘package:stats’:

    cov, smooth, var

> smoother$Predictor <- rownames(smoother)
> names(smoother)[1] <- "Smoother"
> 
> ## Calculate the correlation matrices and keep the columns with the correlations
> ## between the predictors and the outcome
> 
> correlations <- cor(SolContPred)[-(numSolPred+1),(numSolPred+1)]
> rankCorrelations <- cor(SolContPred, method = "spearman")[-(numSolPred+1),(numSolPred+1)]
> corrs <- data.frame(Predictor = names(SolContPred)[1:numSolPred],
+                     Correlation = correlations,
+                     RankCorrelation  = rankCorrelations)
> 
> ## The maximal information coefficient (MIC) values can be obtained from the
> ### minerva package:
> 
> library(minerva)
> MIC <- mine(x = SolContPred[, 1:numSolPred], y = solTrainY)$MIC
> MIC <- data.frame(Predictor = rownames(MIC),
+                   MIC = MIC[,1])
> 
> 
> ## The Relief values for regression can be computed using the CORElearn
> ## package:
> 
> library(CORElearn)
Loading required package: cluster
Loading required package: rpart
> ReliefF <- attrEval(Sol ~ .,  data = SolContPred,
+                     estimator = "RReliefFequalK")
> ReliefF <- data.frame(Predictor = names(ReliefF),
+                   Relief = ReliefF)
> 
> ## Combine them all together for a plot
> contDescrScores <- merge(smoother, corrs)
> contDescrScores <- merge(contDescrScores, MIC)
> contDescrScores <- merge(contDescrScores, ReliefF)
> 
> rownames(contDescrScores) <- contDescrScores$Predictor
> 
> contDescrScores
                          Predictor    Smoother Correlation RankCorrelation
HydrophilicFactor HydrophilicFactor 0.184455208  0.38598321      0.36469127
MolWeight                 MolWeight 0.444393085 -0.65852844     -0.68529880
NumAromaticBonds   NumAromaticBonds 0.168645461 -0.41066466     -0.45787109
NumAtoms                   NumAtoms 0.189931478 -0.43581129     -0.51983173
NumBonds                   NumBonds 0.210717251 -0.45903949     -0.54839850
NumCarbon                 NumCarbon 0.368196173 -0.60679170     -0.67359114
NumChlorine             NumChlorine 0.158529031 -0.39815704     -0.35707519
NumDblBonds             NumDblBonds 0.002409996  0.04909171     -0.02042731
NumHalogen               NumHalogen 0.157187646 -0.39646897     -0.38111965
NumHydrogen             NumHydrogen 0.022654223 -0.15051320     -0.25592586
NumMultBonds           NumMultBonds 0.230799468 -0.48041593     -0.47971353
NumNitrogen             NumNitrogen 0.026032871  0.16134705      0.10078218
NumNonHAtoms           NumNonHAtoms 0.340616555 -0.58362364     -0.62965400
NumNonHBonds           NumNonHBonds 0.342455243 -0.58519676     -0.63228366
NumOxygen                 NumOxygen 0.045245139  0.21270905      0.14954994
NumRings                   NumRings 0.231183499 -0.48081545     -0.50941815
NumRotBonds             NumRotBonds 0.013147325 -0.11466178     -0.14976036
NumSulfer                 NumSulfer 0.005865198 -0.07658458     -0.12090249
SurfaceArea1           SurfaceArea1 0.192535120  0.30325216      0.19339720
SurfaceArea2           SurfaceArea2 0.216936613  0.26663995      0.14057885
                        MIC      Relief
HydrophilicFactor 0.3208456 0.140185965
MolWeight         0.4679277 0.084734907
NumAromaticBonds  0.2705170 0.050013692
NumAtoms          0.2896815 0.008618179
NumBonds          0.3268683 0.002422405
NumCarbon         0.4434121 0.061605610
NumChlorine       0.2011708 0.023813283
NumDblBonds       0.1688472 0.056997492
NumHalogen        0.2017841 0.045002621
NumHydrogen       0.1939521 0.075626122
NumMultBonds      0.2792600 0.051554380
NumNitrogen       0.1535738 0.168280773
NumNonHAtoms      0.3947092 0.036433860
NumNonHBonds      0.3919627 0.035619406
NumOxygen         0.1527421 0.123797003
NumRings          0.3161828 0.056263469
NumRotBonds       0.1754215 0.043556286
NumSulfer         0.1297052 0.062359034
SurfaceArea1      0.2054896 0.120727945
SurfaceArea2      0.2274047 0.117632188
> 
> contDescrSplomData <- contDescrScores
> contDescrSplomData$Correlation <- abs(contDescrSplomData$Correlation)
> contDescrSplomData$RankCorrelation <- abs(contDescrSplomData$RankCorrelation)
> contDescrSplomData$Group <- "Other"
> contDescrSplomData$Group[grepl("Surface", contDescrSplomData$Predictor)] <- "SA"
> 
> featurePlot(solTrainXtrans[, c("NumCarbon", "SurfaceArea2")],
+             solTrainY,
+             between = list(x = 1),
+             type = c("g", "p", "smooth"),
+             df = 3,
+             aspect = 1,
+             labels = c("", "Solubility"))
> 
> 
> splom(~contDescrSplomData[,c(3, 4, 2, 5)],
+       groups = contDescrSplomData$Group,
+       varnames = c("Correlation", "Rank\nCorrelation", "LOESS", "MIC"))
> 
> 
> ## Now look at the categorical (i.e. binary) predictors
> SolCatPred <- solTrainXtrans[, grepl("FP", names(solTrainXtrans))]
> SolCatPred$Sol <- solTrainY
> numSolCatPred <- ncol(SolCatPred) - 1
> 
> tests <- apply(SolCatPred[, 1:numSolCatPred], 2,
+                   function(x, y)
+                     {
+                     tStats <- t.test(y ~ x)[c("statistic", "p.value", "estimate")]
+                     unlist(tStats)
+                     },
+                y = solTrainY)
> ## The results are a matrix with predictors in columns. We reverse this
> tests <- as.data.frame(t(tests))
> names(tests) <- c("t.Statistic", "t.test_p.value", "mean0", "mean1")
> tests$difference <- tests$mean1 - tests$mean0
> tests
      t.Statistic t.test_p.value     mean0     mean1   difference
FP001 -4.02204024   6.287404e-05 -2.978465 -2.451471  0.526993515
FP002 10.28672686   1.351580e-23 -2.021347 -3.313860 -1.292512617
FP003 -2.03644225   4.198619e-02 -2.832164 -2.571855  0.260308757
FP004 -4.94895770   9.551772e-07 -3.128380 -2.427428  0.700951689
FP005 10.28247538   1.576549e-23 -1.969000 -3.262722 -1.293722323
FP006 -7.87583806   9.287835e-15 -3.109421 -2.133832  0.975589032
FP007 -0.88733923   3.751398e-01 -2.759967 -2.646185  0.113781971
FP008  3.32843788   9.119521e-04 -2.582652 -2.999613 -0.416960797
FP009 11.49360533   7.467714e-27 -2.249591 -3.926278 -1.676686955
FP010 -4.11392307   4.973603e-05 -2.824302 -2.232824  0.591478647
FP011 -7.01680213   1.067782e-11 -2.934645 -1.927353  1.007292306
FP012 -1.89255407   5.953582e-02 -2.773755 -2.461369  0.312385742
FP013 11.73267872   1.088092e-24 -2.365485 -4.490696 -2.125210704
FP014 11.47456176   1.157457e-23 -2.375401 -4.508431 -2.133030370
FP015 -7.73718733   1.432769e-12 -4.404286 -2.444487  1.959799162
FP016 -0.61719794   5.377695e-01 -2.733559 -2.631007  0.102551919
FP017  2.73915987   6.681864e-03 -2.654607 -3.098613 -0.444006259
FP018  4.26743510   2.806561e-05 -2.643402 -3.215280 -0.571878063
FP019 -2.31045847   2.207143e-02 -2.766910 -2.370603  0.396306731
FP020 -3.44119896   7.251032e-04 -2.785806 -2.224912  0.560894171
FP021  3.35165112   1.009498e-03 -2.642392 -3.272348 -0.629955482
FP022 -0.66772403   5.051252e-01 -2.728040 -2.637071  0.090969199
FP023  2.18958532   2.989162e-02 -2.673106 -3.042650 -0.369544057
FP024 -2.43189276   1.617811e-02 -2.766457 -2.340841  0.425616224
FP025 -2.68651403   7.981132e-03 -2.771677 -2.312545  0.459131121
FP026  0.58596455   5.591541e-01 -2.709082 -2.821875 -0.112793485
FP027 -4.46177875   1.714807e-05 -2.793800 -2.024516  0.769283405
FP028 -3.36478123   1.011310e-03 -2.791941 -2.101089  0.690852068
FP029  1.50309317   1.346711e-01 -2.696475 -2.913093 -0.216617374
FP030 -4.18564626   5.684141e-05 -2.799582 -1.933933  0.865649782
FP031 -0.19030898   8.494207e-01 -2.721986 -2.683765  0.038221437
FP032 -2.86824205   5.100440e-03 -2.757832 -2.224429  0.533403438
FP033 -2.48343886   1.492327e-02 -2.751062 -2.282879  0.468183359
FP034  0.81786492   4.147985e-01 -2.709737 -2.820263 -0.110526015
FP035  4.17698556   6.851675e-05 -2.659660 -3.471594 -0.811934339
FP036 -5.31186085   6.344823e-07 -2.787224 -1.880417  0.906807452
FP037  1.37213471   1.734895e-01 -2.700271 -2.960000 -0.259728507
FP038 -2.55044552   1.224045e-02 -2.764833 -2.228293  0.536540459
FP039  6.83856010   1.396591e-09 -2.588330 -4.332817 -1.744487356
FP040 -4.96957478   3.640553e-06 -2.788036 -1.771692  1.016343810
FP041  3.86443922   2.274448e-04 -2.672424 -3.403833 -0.731409091
FP042 -1.10149897   2.742144e-01 -2.729509 -2.536852  0.192657624
FP043 -0.18525729   8.535189e-01 -2.721284 -2.680317  0.040966323
FP044 15.19844350   1.458342e-22 -2.472237 -6.582105 -4.109868127
FP045  3.26197779   1.781037e-03 -2.678118 -3.403962 -0.725844224
FP046  7.19096539   1.949765e-12 -2.405146 -3.398700 -0.993554071
FP047  3.08813847   2.106659e-03 -2.611605 -3.013676 -0.402071305
FP048  0.78156187   4.354510e-01 -2.703337 -2.826102 -0.122764360
FP049  9.32620107   1.541509e-16 -2.494036 -4.334828 -1.840791658
FP050  1.78989997   7.537562e-02 -2.684810 -2.984860 -0.300049387
FP051  3.85923300   1.590148e-04 -2.656482 -3.224231 -0.567749069
FP052 -1.37622794   1.707261e-01 -2.736296 -2.542529  0.193767561
FP053  7.79872544   3.863769e-12 -2.565418 -4.201910 -1.636492479
FP054  4.71268264   7.815108e-06 -2.656678 -3.474167 -0.817488623
FP055 -2.15047129   3.539774e-02 -2.743122 -2.285294  0.457828105
FP056  6.56517336   8.289424e-09 -2.598841 -4.435323 -1.836481186
FP057  1.55970276   1.207241e-01 -2.686667 -2.952807 -0.266140351
FP058  1.31266618   1.913070e-01 -2.691483 -2.930000 -0.238517200
FP059  5.30327181   1.388228e-06 -2.662258 -3.692115 -1.029857320
FP060 -6.34967826   3.396521e-10 -3.112819 -2.294192  0.818627333
FP061 -3.23528852   1.258017e-03 -2.903859 -2.489247  0.414612257
FP062 -4.68040368   3.284921e-06 -2.978056 -2.384856  0.593200306
FP063 -5.90647947   4.865776e-09 -3.037509 -2.288593  0.748916565
FP064 -3.19849081   1.427257e-03 -2.887640 -2.481616  0.406023478
FP065 13.67947483   7.369864e-39 -1.740827 -3.389468 -1.648641212
FP066 -3.50425986   4.936856e-04 -3.034043 -2.516776  0.517267265
FP067 -3.71025855   2.192910e-04 -2.894797 -2.430554  0.464242594
FP068 -4.50468714   7.534223e-06 -2.923921 -2.356221  0.567699992
FP069 -1.39582672   1.631126e-01 -2.782438 -2.605872  0.176566128
FP070 11.33500604   6.532630e-27 -2.155840 -3.739142 -1.583301881
FP071  9.16039412   1.012284e-18 -2.295828 -3.588521 -1.292692775
FP072 -9.86673490   4.502526e-21 -3.674277 -2.222396  1.451880757
FP073 -6.31556184   4.773987e-10 -2.972104 -2.154780  0.817323998
FP074 -3.16365915   1.617158e-03 -2.849299 -2.446958  0.402341137
FP075 -4.83159241   1.618286e-06 -2.926916 -2.311584  0.615331888
FP076 18.19671006   2.170836e-57 -1.949953 -4.292756 -2.342803359
FP077 -0.24434665   8.070283e-01 -2.728715 -2.697082  0.031633203
FP078 -0.49694487   6.193690e-01 -2.737523 -2.675156  0.062366949
FP079 12.46647477   2.609452e-32 -1.649763 -3.199207 -1.549444605
FP080 -4.44534892   1.029202e-05 -2.896848 -2.308160  0.588687940
FP081  0.11125946   9.114457e-01 -2.714519 -2.729057 -0.014537653
FP082 12.55490234   3.329065e-32 -1.573824 -3.177143 -1.603319328
FP083 -6.28835488   5.760827e-10 -2.932735 -2.149385  0.783350551
FP084 -3.43524930   6.332047e-04 -2.851414 -2.386949  0.464465314
FP085 10.47209331   1.134762e-22 -2.307585 -3.916008 -1.608423485
FP086  1.02088695   3.077271e-01 -2.682101 -2.817578 -0.135477406
FP087 11.07193302   5.850147e-26 -1.684808 -3.107540 -1.422732105
FP088 -4.82078133   1.873320e-06 -2.891398 -2.233960  0.657438003
FP089 15.68684642   7.559612e-42 -2.131606 -4.506936 -2.375330025
FP090  0.72850761   4.666345e-01 -2.693950 -2.792743 -0.098793036
FP091 -1.97821299   4.847758e-02 -2.777626 -2.515187  0.262438593
FP092 12.71461669   9.160201e-31 -2.250250 -4.169957 -1.919706549
FP093  2.40580805   1.652056e-02 -2.636787 -2.972026 -0.335238658
FP094 -1.08529331   2.783195e-01 -2.751874 -2.607909  0.143965054
FP095 -4.83150303   1.885749e-06 -2.863571 -2.203780  0.659791524
FP096 -0.05816460   9.536450e-01 -2.720323 -2.712271  0.008052049
FP097  9.06740092   4.508890e-18 -2.420977 -3.684420 -1.263443027
FP098 -3.09495737   2.088014e-03 -2.820538 -2.391460  0.429077754
FP099  4.51553294   8.153915e-06 -2.575959 -3.203843 -0.627883409
FP100 -4.26730797   2.354655e-05 -2.846430 -2.293727  0.552702276
FP101 -3.33565277   9.211008e-04 -2.828760 -2.363022  0.465738108
FP102  1.25032500   2.119440e-01 -2.683373 -2.857708 -0.174335474
FP103  2.51185846   1.236590e-02 -2.644038 -2.984808 -0.340770007
FP104  1.23433987   2.176989e-01 -2.681746 -2.846934 -0.165188360
FP105  2.56644125   1.063908e-02 -2.640201 -3.003756 -0.363555025
FP106  2.42187970   1.595574e-02 -2.652367 -2.998297 -0.345929993
FP107 10.92623859   2.395320e-23 -2.328707 -4.173284 -1.844576915
FP108 -0.88386799   3.773218e-01 -2.744087 -2.619641  0.124446276
FP109  1.72666429   8.493856e-02 -2.681392 -2.891845 -0.210453156
FP110 -4.30633122   2.083157e-05 -2.839272 -2.253622  0.585649074
FP111  0.07891212   9.371465e-01 -2.716361 -2.727594 -0.011232326
FP112 13.31169435   4.090297e-31 -2.293512 -4.478541 -2.185028791
FP113 -4.25438885   2.743420e-05 -2.842824 -2.207527  0.635296648
FP114  0.38442341   7.009005e-01 -2.711034 -2.759459 -0.048425836
FP115 -0.49398272   6.216320e-01 -2.730653 -2.663059  0.067594185
FP116 -3.39726200   7.657795e-04 -2.815911 -2.310055  0.505856814
FP117  3.16005628   1.769096e-03 -2.623060 -3.157353 -0.534292762
FP118 -3.88255786   1.272871e-04 -2.835755 -2.226776  0.608979252
FP119 -0.71996857   4.720764e-01 -2.734485 -2.636839  0.097646215
FP120 -3.25854728   1.280523e-03 -2.807793 -2.270759  0.537033697
FP121  0.62156119   5.349141e-01 -2.704487 -2.805188 -0.100701417
FP122 -2.44169102   1.530759e-02 -2.781836 -2.396154  0.385682632
FP123  3.52755166   4.929055e-04 -2.628914 -3.165157 -0.536243091
FP124 -3.58983366   3.953044e-04 -2.806888 -2.261494  0.545394825
FP125 -2.91655379   3.853055e-03 -2.786364 -2.350743  0.435620393
FP126 -1.44180023   1.505173e-01 -2.748395 -2.547234  0.201161019
FP127 -2.66597987   8.213408e-03 -2.773386 -2.381429  0.391957737
FP128 -3.37747584   8.536233e-04 -2.794086 -2.284752  0.509334647
FP129  3.28855844   1.192299e-03 -2.642100 -3.193030 -0.550930181
FP130  1.02990587   3.048783e-01 -2.698555 -2.888900 -0.190345358
FP131 -0.49682548   6.198471e-01 -2.727954 -2.653583  0.074370939
FP132 -5.89680424   1.633112e-08 -2.832055 -1.925126  0.906929238
FP133 -1.83896087   6.756107e-02 -2.757100 -2.451750  0.305349880
FP134  3.16620016   1.761695e-03 -2.661506 -3.110000 -0.448493976
FP135 -2.94236705   3.709259e-03 -2.783827 -2.266667  0.517160048
FP136 -2.02006233   4.501990e-02 -2.761938 -2.403304  0.358633451
FP137 -0.07855180   9.374873e-01 -2.720131 -2.706636  0.013494433
FP138 -1.44829927   1.496787e-01 -2.748083 -2.483302  0.264780953
FP139 -0.22212826   8.246439e-01 -2.721936 -2.680897  0.041038417
FP140 -1.86990507   6.355486e-02 -2.758036 -2.403962  0.354073239
FP141  4.15441700   4.792655e-05 -2.650655 -3.232523 -0.581867761
FP142 -2.92307611   4.047862e-03 -2.779233 -2.224519  0.554713355
FP143  0.83414756   4.061300e-01 -2.705904 -2.862338 -0.156433772
FP144 -4.98991305   1.904653e-06 -2.819214 -1.852424  0.966789373
FP145 -3.99831545   1.002597e-04 -2.787077 -2.128990  0.658087566
FP146  6.08904552   1.064009e-08 -2.608687 -3.675000 -1.066313013
FP147 -2.98364059   3.376138e-03 -2.776357 -2.226800  0.549557227
FP148 -4.00444775   1.101041e-04 -2.780300 -2.073012  0.707287491
FP149  9.67498002   8.530838e-16 -2.479225 -5.125930 -2.646704799
FP150 -1.59224059   1.145443e-01 -2.742808 -2.435467  0.307341553
FP151 -1.68674372   9.608846e-02 -2.736013 -2.423019  0.312994495
FP152  2.02103329   4.549820e-02 -2.692325 -3.012308 -0.319982377
FP153  0.83775227   4.044086e-01 -2.703900 -2.892432 -0.188532775
FP154 -0.18701160   8.526043e-01 -2.720525 -2.668889  0.051635701
FP155  4.93743429   3.813516e-06 -2.653412 -3.592273 -0.938860298
FP156  2.70254904   8.178498e-03 -2.685045 -3.160896 -0.475850274
FP157 -1.19798365   2.351567e-01 -2.738105 -2.423220  0.314885042
FP158 -3.18371959   2.293303e-03 -2.757078 -2.039020  0.718058170
FP159  2.90626659   4.444806e-03 -2.687590 -3.127313 -0.439722935
FP160  0.72930617   4.673596e-01 -2.711400 -2.816308 -0.104908144
FP161 -8.02084404   8.158474e-12 -2.826779 -1.193333  1.633445946
FP162  9.05654884   7.502729e-19 -2.147208 -3.300849 -1.153640924
FP163 -4.73411111   2.565152e-06 -3.009759 -2.398455  0.611304290
FP164 11.15556043   6.131703e-27 -1.830706 -3.245042 -1.414335661
FP165 -3.26163144   1.150990e-03 -2.862294 -2.450602  0.411691613
FP166  6.01599552   3.059094e-09 -2.441541 -3.277905 -0.836363881
FP167 -3.77468033   1.718080e-04 -2.874742 -2.398718  0.476023835
FP168 12.78784085   6.302482e-34 -1.659686 -3.250521 -1.590835792
FP169 10.79840624   1.952902e-22 -2.370413 -4.241017 -1.870603512
FP170  1.45059296   1.480425e-01 -2.674961 -2.911943 -0.236981517
FP171 -3.56151646   4.354270e-04 -2.810722 -2.266398  0.544324003
FP172 13.04070659   8.112523e-28 -2.345390 -4.809931 -2.464540221
FP173  2.68918003   7.770466e-03 -2.653554 -3.111556 -0.458001634
FP174  0.94721964   3.446525e-01 -2.699492 -2.845806 -0.146314311
FP175  0.01020115   9.918704e-01 -2.718360 -2.719922 -0.001562215
FP176 -2.29447613   2.298911e-02 -2.766395 -2.374310  0.392084865
FP177 -1.08253877   2.802959e-01 -2.737548 -2.580609  0.156939151
FP178  3.27582610   1.258481e-03 -2.656782 -3.167739 -0.510956834
FP179  0.85670987   3.931634e-01 -2.703846 -2.854409 -0.150562448
FP180 -2.83913345   5.188161e-03 -2.773274 -2.263235  0.510039146
FP181  6.24259165   6.005980e-09 -2.617726 -3.695281 -1.077554681
FP182 -2.11887211   3.595632e-02 -2.755239 -2.384255  0.370983887
FP183 -2.62186301   1.015591e-02 -2.755210 -2.271250  0.483960466
FP184 10.24979020   9.572172e-17 -2.493318 -5.171000 -2.677681975
FP185  3.21519455   1.718715e-03 -2.667230 -3.270000 -0.602770115
FP186 -2.10893733   3.756740e-02 -2.749818 -2.342740  0.407078042
FP187 -0.14233858   8.871705e-01 -2.721122 -2.685942  0.035180420
FP188 -2.76497219   7.083803e-03 -2.760011 -2.153692  0.606318979
FP189  0.29230393   7.707177e-01 -2.713884 -2.774932 -0.061047680
FP190  8.23796541   2.799252e-12 -2.574785 -4.556522 -1.981737159
FP191 -1.62000293   1.089976e-01 -2.742364 -2.404627  0.337737388
FP192  0.55100083   5.833593e-01 -2.711377 -2.829310 -0.117932965
FP193 11.06173597   1.595927e-16 -2.525146 -5.642881 -3.117735616
FP194 -1.03294441   3.047671e-01 -2.728916 -2.553214  0.175701915
FP195 -5.88072667   1.035398e-07 -2.786495 -1.672759  1.113736340
FP196  6.42707826   1.269199e-08 -2.651126 -3.838889 -1.187762913
FP197  3.82944792   3.167065e-04 -2.670555 -3.583800 -0.913245061
FP198 -3.87872401   2.598433e-04 -2.776165 -1.761852  1.014313143
FP199  0.59118217   5.569865e-01 -2.711578 -2.859333 -0.147754967
FP200  5.15622561   3.020793e-06 -2.668319 -3.685106 -1.016787799
FP201 -3.92629512   2.100852e-04 -2.757414 -2.018600  0.738813984
FP202  5.92935333   6.082278e-09 -2.496969 -3.357143 -0.860174019
FP203  1.09341446   2.759667e-01 -2.695582 -2.896147 -0.200564841
FP204  2.86078975   4.868444e-03 -2.672159 -3.141702 -0.469543435
FP205  5.61427744   2.488511e-07 -2.605564 -4.057838 -1.452273414
FP206  3.58353985   6.162975e-04 -2.674519 -3.409474 -0.734954669
FP207  8.34894566   1.153650e-11 -2.595151 -4.768704 -2.173553202
FP208  1.37823055   1.702203e-01 -2.690237 -2.942056 -0.251819108
> 
> ## Create a volcano plot
> 
> xyplot(-log10(t.test_p.value) ~ difference,
+        data = tests,
+        xlab = "Mean With Structure - Mean Without Structure",
+        ylab = "-log(p-Value)",
+        type = "p")
> 
> ################################################################################
> ### Section 18.2 Categorical Outcomes
> 
> ## Load the segmentation data
> 
> data(segmentationData)
> segTrain <- subset(segmentationData, Case == "Train")
> segTrain$Case <- segTrain$Cell <- NULL
> 
> segTest <- subset(segmentationData, Case != "Train")
> segTest$Case <- segTest$Cell <- NULL
> 
> ## Compute the areas under the ROC curve
> aucVals <- filterVarImp(x = segTrain[, -1], y = segTrain$Class)
> aucVals$Predictor <- rownames(aucVals)
> 
> ## Cacluate the t-tests as before but with x and y switched
> segTests <- apply(segTrain[, -1], 2,
+                   function(x, y)
+                     {
+                     tStats <- t.test(x ~ y)[c("statistic", "p.value", "estimate")]
+                     unlist(tStats)
+                     },
+                y = segTrain$Class)
> segTests <- as.data.frame(t(segTests))
> names(segTests) <- c("t.Statistic", "t.test_p.value", "mean0", "mean1")
> segTests$Predictor <- rownames(segTests)
> 
> ## Fit a random forest model and get the importance scores
> library(randomForest)
randomForest 4.6-7
Type rfNews() to see new features/changes/bug fixes.
> set.seed(791)
> rfImp <- randomForest(Class ~ ., data = segTrain, 
+                       ntree = 2000, 
+                       importance = TRUE)
> rfValues <- data.frame(RF = importance(rfImp)[, "MeanDecreaseGini"],
+                        Predictor = rownames(importance(rfImp)))
> 
> ## Now compute the Relief scores
> set.seed(791)
> 
> ReliefValues <- attrEval(Class ~ ., data = segTrain,
+                          estimator="ReliefFequalK", ReliefIterations = 50)
> ReliefValues <- data.frame(Relief = ReliefValues,
+                            Predictor = names(ReliefValues))
> 
> ## and the MIC statistics
> set.seed(791)
> segMIC <- mine(x = segTrain[, -1],
+                ## Pass the outcome as 0/1
+                y = ifelse(segTrain$Class == "PS", 1, 0))$MIC
> segMIC <- data.frame(Predictor = rownames(segMIC),
+                   MIC = segMIC[,1])
> 
> 
> rankings <- merge(segMIC, ReliefValues)
> rankings <- merge(rankings, rfValues)
> rankings <- merge(rankings, segTests)
> rankings <- merge(rankings, aucVals)
> rankings
                 Predictor         MIC       Relief        RF  t.Statistic
1                 AngleCh1 0.131057008  0.002287557  4.730963  -0.21869850
2                  AreaCh1 0.108083908  0.016041257  4.315317  -0.93160658
3              AvgIntenCh1 0.292046076  0.071057681 18.865802 -11.75400848
4              AvgIntenCh2 0.329484594  0.150684824 21.857848 -16.09400822
5              AvgIntenCh3 0.135443794  0.018172519  5.135363  -0.14752973
6              AvgIntenCh4 0.166545039 -0.007167866  5.434737  -6.23725001
7   ConvexHullAreaRatioCh1 0.299627157  0.035983697 19.093048  14.22756193
8  ConvexHullPerimRatioCh1 0.254931744  0.041865999 12.624038 -13.86697029
9      DiffIntenDensityCh1 0.239224382  0.038582763  7.335741  -9.81721615
10     DiffIntenDensityCh3 0.133084659  0.010830941  6.647198   1.48785690
11     DiffIntenDensityCh4 0.147643832  0.042352546  5.386981  -5.54840221
12         EntropyIntenCh1 0.261097110  0.129280729 13.867582 -14.04326173
13         EntropyIntenCh3 0.172122729  0.039687246  5.127465   6.94689541
14         EntropyIntenCh4 0.185625627  0.021260676  5.742739  -9.03621024
15           EqCircDiamCh1 0.108083908  0.038820971  4.185607  -1.85186912
16         EqEllipseLWRCh1 0.212579943  0.016550609  5.708705   9.83868863
17   EqEllipseOblateVolCh1 0.122276159  0.010367074  3.906543   1.35616134
18  EqEllipseProlateVolCh1 0.169674904 -0.005386670  6.018121  -1.29243801
19         EqSphereAreaCh1 0.108083908  0.016110539  4.183567  -0.93273061
20          EqSphereVolCh1 0.108083908  0.003440003  4.133475  -0.04348657
21          FiberAlign2Ch3 0.177116842 -0.002628403  4.373886   3.65095007
22          FiberAlign2Ch4 0.149937844  0.016047962  4.868552   2.07009183
23          FiberLengthCh1 0.220505513  0.050610471  8.368712   9.26429955
24           FiberWidthCh1 0.368720274  0.107691201 33.371913 -18.96852051
25         IntenCoocASMCh3 0.196466490  0.024738010  7.298595  -7.95107008
26         IntenCoocASMCh4 0.147981004  0.005574684  3.734085   4.51016239
27    IntenCoocContrastCh3 0.231500707  0.021282305  8.438533  13.20540372
28    IntenCoocContrastCh4 0.135150335 -0.002605380  4.567712   1.02551789
29     IntenCoocEntropyCh3 0.202905819  0.039769279  6.354566   9.62738946
30     IntenCoocEntropyCh4 0.148928924  0.042214966  4.234247  -5.73801017
31         IntenCoocMaxCh3 0.193078547  0.039834486  6.865277 -10.01109754
32         IntenCoocMaxCh4 0.152580596  0.064488810  3.966995   5.02868895
33            KurtIntenCh1 0.200874103  0.003243188  7.095402   3.18226166
34            KurtIntenCh3 0.135694293  0.010944913  4.237905  -2.46783420
35            KurtIntenCh4 0.152775633  0.011328311  5.339427   4.39807449
36               LengthCh1 0.149378763  0.044483732  4.235474   5.28480181
37      NeighborAvgDistCh1 0.123412342  0.023330722  4.266566  -0.46614250
38      NeighborMinDistCh1 0.125623472  0.007850922  5.152365   0.80769702
39      NeighborVarDistCh1 0.124259322  0.016447793  4.286239   0.29886752
40                PerimCh1 0.170013515  0.025272254  4.115593   6.18542523
41             ShapeBFRCh1 0.235667275  0.005194794  9.782458 -13.25311412
42             ShapeLWRCh1 0.183599199  0.029568271  4.745873   8.40241429
43             ShapeP2ACh1 0.332238080  0.073795605 19.362332  14.75801555
44            SkewIntenCh1 0.259680600  0.085229983 13.628434   9.66411304
45            SkewIntenCh3 0.149153858  0.056669970  4.244103  -3.76453794
46            SkewIntenCh4 0.152202895  0.002508761  5.478398   6.46619794
47       SpotFiberCountCh3 0.005721744 -0.005692308  1.793200  -0.53238018
48       SpotFiberCountCh4 0.019496167 -0.015192982  2.948225   2.98634139
49           TotalIntenCh1 0.304429766  0.045548534 20.916993  -8.20041297
50           TotalIntenCh2 0.400952572  0.185416030 41.617068 -14.54087193
51           TotalIntenCh3 0.115771733  0.015068883  5.402005  -0.46828755
52           TotalIntenCh4 0.186643156  0.006071748  5.712561  -5.64791505
53             VarIntenCh1 0.241235863  0.045687478  9.259561 -10.40110966
54             VarIntenCh3 0.150238051  0.002815999  5.176123  -2.44172596
55             VarIntenCh4 0.171222193  0.001547820  5.981325  -4.83455579
56                WidthCh1 0.146204548  0.021560423  5.113884  -1.59227638
57               XCentroid 0.106662637 -0.037877551  4.220162   1.10633278
58               YCentroid 0.119516938  0.055209622  4.908536   2.19081435
   t.test_p.value        mean0        mean1        PS        WS
1    8.269443e-01 9.086539e+01 9.157148e+01 0.5025967 0.5025967
2    3.517830e-01 3.205519e+02 3.329249e+02 0.5709170 0.5709170
3    4.819837e-28 7.702212e+01 2.146922e+02 0.7662375 0.7662375
4    2.530403e-50 1.324405e+02 2.778397e+02 0.7866146 0.7866146
5    8.827553e-01 9.578766e+01 9.671147e+01 0.5214098 0.5214098
6    7.976250e-10 1.168287e+02 1.795797e+02 0.6473814 0.6473814
7    5.895088e-42 1.270408e+00 1.114054e+00 0.7815519 0.7815519
8    4.644231e-40 8.714806e-01 9.310403e-01 0.7547844 0.7547844
9    6.509740e-21 6.055821e+01 9.601373e+01 0.7161591 0.7161591
10   1.371842e-01 7.753072e+01 7.104993e+01 0.5427353 0.5427353
11   4.178896e-08 7.508542e+01 1.061125e+02 0.6294704 0.6294704
12   5.145995e-40 6.364841e+00 7.004622e+00 0.7565169 0.7565169
13   8.836060e-12 5.704662e+00 5.014508e+00 0.6340145 0.6340145
14   9.775620e-19 5.192365e+00 6.023039e+00 0.6661861 0.6661861
15   6.437960e-02 1.940093e+01 2.002646e+01 0.5709170 0.5709170
16   7.218411e-22 2.371177e+00 1.758240e+00 0.6965915 0.6965915
17   1.753561e-01 7.632288e+02 6.866693e+02 0.5045568 0.5045568
18   1.965213e-01 3.543481e+02 3.920429e+02 0.6301870 0.6301870
19   3.512025e-01 1.284179e+03 1.333731e+03 0.5709170 0.5709170
20   9.653226e-01 5.017110e+03 5.033648e+03 0.5709170 0.5709170
21   2.770065e-04 1.479185e+00 1.421565e+00 0.5690728 0.5690728
22   3.873106e-02 1.444148e+00 1.412867e+00 0.5421535 0.5421535
23   1.239044e-19 3.991835e+01 2.819142e+01 0.7007984 0.7007984
24   1.162284e-64 8.691444e+00 1.282684e+01 0.8355127 0.8355127
25   1.067683e-14 7.373161e-02 1.559897e-01 0.6956093 0.6956093
26   7.290850e-06 1.131789e-01 7.724074e-02 0.5878438 0.5878438
27   7.794899e-37 1.163875e+01 6.292079e+00 0.7214199 0.7214199
28   3.053656e-01 7.700191e+00 7.343397e+00 0.5358642 0.5358642
29   1.282007e-20 6.201308e+00 5.216667e+00 0.6891345 0.6891345
30   1.313352e-08 5.545934e+00 6.032306e+00 0.6073356 0.6073356
31   4.418432e-22 1.900393e-01 3.245564e-01 0.6944627 0.6944627
32   5.990072e-07 2.707207e-01 2.131262e-01 0.5892938 0.5892938
33   1.506054e-03 1.208829e+00 3.868323e-01 0.6711982 0.6711982
34   1.388162e-02 3.121647e+00 4.480168e+00 0.5513936 0.5513936
35   1.210957e-05 1.388322e+00 2.421078e-01 0.6046335 0.6046335
36   1.571520e-07 3.237304e+01 2.839838e+01 0.6015142 0.6015142
37   6.412508e-01 2.294382e+02 2.307292e+02 0.5047676 0.5047676
38   4.194740e-01 3.020875e+01 2.962558e+01 0.5018274 0.5018274
39   7.651196e-01 1.046047e+02 1.042038e+02 0.5072546 0.5072546
40   9.075622e-10 9.721959e+01 8.203652e+01 0.6200196 0.6200196
41   6.819382e-37 5.630603e-01 6.406694e-01 0.7319836 0.7319836
42   1.498789e-16 1.968091e+00 1.601640e+00 0.6607778 0.6607778
43   9.265729e-45 2.380621e+00 1.606325e+00 0.7930978 0.7930978
44   6.631564e-21 8.687084e-01 4.124373e-01 0.7253275 0.7253275
45   1.819323e-04 1.429871e+00 1.711829e+00 0.5732881 0.5732881
46   1.592246e-10 1.069003e+00 7.366442e-01 0.6193873 0.6193873
47   5.946089e-01 1.915094e+00 1.970509e+00 0.5173630 0.5173630
48   2.894728e-03 7.224843e+00 6.477212e+00 0.4619775 0.4619775
49   1.624963e-15 2.494150e+04 6.265354e+04 0.7895358 0.7895358
50   3.385024e-43 3.858694e+04 7.665351e+04 0.8012840 0.8012840
51   6.397155e-01 2.685926e+04 2.770986e+04 0.5094972 0.5094972
52   2.290183e-08 3.466429e+04 5.217025e+04 0.6599073 0.6599073
53   5.662429e-23 5.142099e+01 1.136596e+02 0.7322365 0.7322365
54   1.488950e-02 9.519852e+01 1.127093e+02 0.5330821 0.5330821
55   1.632212e-06 1.063653e+02 1.430475e+02 0.6322357 0.6322357
56   1.116486e-01 1.754162e+01 1.813792e+01 0.5799484 0.5799484
57   2.689098e-01 2.698852e+02 2.599759e+02 0.5216669 0.5216669
58   2.875168e-02 1.842972e+02 1.691475e+02 0.5407878 0.5407878
> 
> rankings$channel <- "Channel 1"
> rankings$channel[grepl("Ch2$", rankings$Predictor)] <- "Channel 2"
> rankings$channel[grepl("Ch3$", rankings$Predictor)] <- "Channel 3"
> rankings$channel[grepl("Ch4$", rankings$Predictor)] <- "Channel 4"
> rankings$t.Statistic <- abs(rankings$t.Statistic)
> 
> splom(~rankings[, c("PS", "t.Statistic", "RF", "Relief", "MIC")],
+       groups = rankings$channel,
+       varnames = c("ROC\nAUC", "Abs\nt-Stat", "Random\nForest", "Relief", "MIC"),
+       auto.key = list(columns = 2))
> 
> 
> ## Load the grant data. A script to create and save these data is contained
> ## in the same directory as this file.
> 
> load("grantData.RData")
> 
> dataSubset <- training[pre2008, c("Sponsor62B", "ContractValueBandUnk", "RFCD240302")]
> 
> ## This is a simple function to compute several statistics for binary predictors
> tableCalcs <- function(x, y)
+   {
+   tab <- table(x, y)
+   fet <- fisher.test(tab)
+   out <- c(OR = fet$estimate,
+            P = fet$p.value,
+            Gain = attrEval(y ~ x, estimator = "GainRatio"))
+   }
> 
> ## lapply() is used to execute the function on each column
> tableResults <- lapply(dataSubset, tableCalcs, y = training[pre2008, "Class"])
> 
> ## The results come back as a list of vectors, and "rbind" is used to join
> ## then together as rows of a table
> tableResults <- do.call("rbind", tableResults)
> tableResults
                     OR.odds ratio             P       Gain.x
Sponsor62B                6.040826  2.643795e-07 0.0472613504
ContractValueBandUnk      6.294236 1.718209e-263 0.1340764356
RFCD240302                1.097565  8.515664e-01 0.0001664263
> 
> ## The permuted Relief scores can be computed using a function from the
> ## AppliedPredictiveModeling package. 
> 
> permuted <- permuteRelief(x = training[pre2008, c("Sponsor62B", "Day", "NumCI")], 
+                           y = training[pre2008, "Class"],
+                           nperm = 500,
+                           ### the remaining options are passed to attrEval()
+                           estimator="ReliefFequalK", 
+                           ReliefIterations= 50)
> 
> ## The original Relief scores:
> permuted$observed
  Sponsor62B          Day        NumCI 
 0.000000000  0.036490637 -0.009047619 
> 
> ## The number of standard deviations away from the permuted mean:
> permuted$standardized
 Sponsor62B         Day       NumCI 
-0.08258544  4.50898453 -1.07569741 
> 
> ## The distributions of the scores if there were no relationship between the
> ## predictors and outcomes
> 
> histogram(~value|Predictor, 
+           data = permuted$permutations, 
+           xlim = extendrange(permuted$permutations$value),
+           xlab = "Relief Score")
> 
> 
> ################################################################################
> ### Session Information
> 
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] randomForest_4.6-7              CORElearn_0.9.41               
 [3] rpart_4.1-1                     cluster_1.14.4                 
 [5] minerva_1.3                     pROC_1.5.4                     
 [7] plyr_1.8                        caret_6.0-22                   
 [9] ggplot2_0.9.3.1                 lattice_0.20-15                
[11] AppliedPredictiveModeling_1.1-5

loaded via a namespace (and not attached):
 [1] car_2.0-16         codetools_0.2-8    colorspace_1.2-1   dichromat_2.0-0   
 [5] digest_0.6.3       foreach_1.4.0      grid_3.0.1         gtable_0.1.2      
 [9] iterators_1.0.6    labeling_0.1       MASS_7.3-26        munsell_0.4       
[13] parallel_3.0.1     proto_0.3-10       RColorBrewer_1.0-5 reshape2_1.2.2    
[17] scales_0.2.3       stringr_0.6.2      tools_3.0.1       
> 
> q("no")
> proc.time()
   user  system elapsed 
 78.161   0.635  79.081 
In [101]:
%%R -w 600 -h 600

runChapterScript(18)

##     user  system elapsed 
##   78.161   0.635  79.081
NULL
In [92]:
%%R

### Section 18.1 Numeric Outcomes

## Load the solubility data

library(AppliedPredictiveModeling)
data(solubility)

trainData <- solTrainXtrans
trainData$y <- solTrainY


## keep the continuous predictors and append the outcome to the data frame
SolContPred <- solTrainXtrans[, !grepl("FP", names(solTrainXtrans))]
numSolPred <- ncol(SolContPred)
SolContPred$Sol <- solTrainY

## Get the LOESS smoother and the summary measure
library(caret)
smoother <- filterVarImp(x = SolContPred[, -ncol(SolContPred)],
                         y = solTrainY,
                         nonpara = TRUE)
smoother$Predictor <- rownames(smoother)
names(smoother)[1] <- "Smoother"

## Calculate the correlation matrices and keep the columns with the correlations
## between the predictors and the outcome

correlations <- cor(SolContPred)[-(numSolPred+1),(numSolPred+1)]
rankCorrelations <- cor(SolContPred, method = "spearman")[-(numSolPred+1),(numSolPred+1)]
corrs <- data.frame(Predictor = names(SolContPred)[1:numSolPred],
                    Correlation = correlations,
                    RankCorrelation  = rankCorrelations)

## The maximal information coefficient (MIC) values can be obtained from the
### minerva package:

library(minerva)
MIC <- mine(x = SolContPred[, 1:numSolPred], y = solTrainY)$MIC
MIC <- data.frame(Predictor = rownames(MIC),
                  MIC = MIC[,1])


## The Relief values for regression can be computed using the CORElearn
## package:

library(CORElearn)
ReliefF <- attrEval(Sol ~ .,  data = SolContPred,
                    estimator = "RReliefFequalK")
ReliefF <- data.frame(Predictor = names(ReliefF),
                  Relief = ReliefF)

## Combine them all together for a plot
contDescrScores <- merge(smoother, corrs)
contDescrScores <- merge(contDescrScores, MIC)
contDescrScores <- merge(contDescrScores, ReliefF)

rownames(contDescrScores) <- contDescrScores$Predictor

print(
contDescrScores
)

contDescrSplomData <- contDescrScores
contDescrSplomData$Correlation <- abs(contDescrSplomData$Correlation)
contDescrSplomData$RankCorrelation <- abs(contDescrSplomData$RankCorrelation)
contDescrSplomData$Group <- "Other"
contDescrSplomData$Group[grepl("Surface", contDescrSplomData$Predictor)] <- "SA"
                          Predictor    Smoother Correlation RankCorrelation
HydrophilicFactor HydrophilicFactor 0.184455208  0.38598321      0.36469127
MolWeight                 MolWeight 0.444393085 -0.65852844     -0.68529880
NumAromaticBonds   NumAromaticBonds 0.168645461 -0.41066466     -0.45787109
NumAtoms                   NumAtoms 0.189931478 -0.43581129     -0.51983173
NumBonds                   NumBonds 0.210717251 -0.45903949     -0.54839850
NumCarbon                 NumCarbon 0.368196173 -0.60679170     -0.67359114
NumChlorine             NumChlorine 0.158529031 -0.39815704     -0.35707519
NumDblBonds             NumDblBonds 0.002409996  0.04909171     -0.02042731
NumHalogen               NumHalogen 0.157187646 -0.39646897     -0.38111965
NumHydrogen             NumHydrogen 0.022654223 -0.15051320     -0.25592586
NumMultBonds           NumMultBonds 0.230799468 -0.48041593     -0.47971353
NumNitrogen             NumNitrogen 0.026032871  0.16134705      0.10078218
NumNonHAtoms           NumNonHAtoms 0.340616555 -0.58362364     -0.62965400
NumNonHBonds           NumNonHBonds 0.342455243 -0.58519676     -0.63228366
NumOxygen                 NumOxygen 0.045245139  0.21270905      0.14954994
NumRings                   NumRings 0.231183499 -0.48081545     -0.50941815
NumRotBonds             NumRotBonds 0.013147325 -0.11466178     -0.14976036
NumSulfer                 NumSulfer 0.005865198 -0.07658458     -0.12090249
SurfaceArea1           SurfaceArea1 0.192535120  0.30325216      0.19339720
SurfaceArea2           SurfaceArea2 0.216936613  0.26663995      0.14057885
                        MIC      Relief
HydrophilicFactor 0.3208456 0.140185965
MolWeight         0.4679277 0.084734907
NumAromaticBonds  0.2705170 0.050013692
NumAtoms          0.2896815 0.008618179
NumBonds          0.3268683 0.002422405
NumCarbon         0.4434121 0.061605610
NumChlorine       0.2011708 0.023813283
NumDblBonds       0.1688472 0.056997492
NumHalogen        0.2017841 0.045002621
NumHydrogen       0.1939521 0.075626122
NumMultBonds      0.2792600 0.051554380
NumNitrogen       0.1535738 0.168280773
NumNonHAtoms      0.3947092 0.036433860
NumNonHBonds      0.3919627 0.035619406
NumOxygen         0.1527421 0.123797003
NumRings          0.3161828 0.056263469
NumRotBonds       0.1754215 0.043556286
NumSulfer         0.1297052 0.062359034
SurfaceArea1      0.2054896 0.120727945
SurfaceArea2      0.2274047 0.117632188
In [93]:
%%R

print(
featurePlot(solTrainXtrans[, c("NumCarbon", "SurfaceArea2")],
            solTrainY,
            between = list(x = 1),
            type = c("g", "p", "smooth"),
            df = 3,
            aspect = 1,
            labels = c("", "Solubility"))
)

print(
splom(~contDescrSplomData[,c(3, 4, 2, 5)],
      groups = contDescrSplomData$Group,
      varnames = c("Correlation", "Rank\nCorrelation", "LOESS", "MIC"))
)
In [130]:
%%R

## Now look at the categorical (i.e. binary) predictors
SolCatPred <- solTrainXtrans[, grepl("FP", names(solTrainXtrans))]
SolCatPred$Sol <- solTrainY
numSolCatPred <- ncol(SolCatPred) - 1

tests <- apply(SolCatPred[, 1:numSolCatPred], 2,
                  function(x, y)
                    {
                    tStats <- t.test(y ~ x)[c("statistic", "p.value", "estimate")]
                    unlist(tStats)
                    },
               y = solTrainY)

## The results are a matrix with predictors in columns. We reverse this
tests <- as.data.frame(t(tests))
names(tests) <- c("t.Statistic", "t.test_p.value", "mean0", "mean1")
tests$difference <- tests$mean1 - tests$mean0

print(
tests
)
      t.Statistic t.test_p.value     mean0     mean1   difference
FP001 -4.02204024   6.287404e-05 -2.978465 -2.451471  0.526993515
FP002 10.28672686   1.351580e-23 -2.021347 -3.313860 -1.292512617
FP003 -2.03644225   4.198619e-02 -2.832164 -2.571855  0.260308757
FP004 -4.94895770   9.551772e-07 -3.128380 -2.427428  0.700951689
FP005 10.28247538   1.576549e-23 -1.969000 -3.262722 -1.293722323
FP006 -7.87583806   9.287835e-15 -3.109421 -2.133832  0.975589032
FP007 -0.88733923   3.751398e-01 -2.759967 -2.646185  0.113781971
FP008  3.32843788   9.119521e-04 -2.582652 -2.999613 -0.416960797
FP009 11.49360533   7.467714e-27 -2.249591 -3.926278 -1.676686955
FP010 -4.11392307   4.973603e-05 -2.824302 -2.232824  0.591478647
FP011 -7.01680213   1.067782e-11 -2.934645 -1.927353  1.007292306
FP012 -1.89255407   5.953582e-02 -2.773755 -2.461369  0.312385742
FP013 11.73267872   1.088092e-24 -2.365485 -4.490696 -2.125210704
FP014 11.47456176   1.157457e-23 -2.375401 -4.508431 -2.133030370
FP015 -7.73718733   1.432769e-12 -4.404286 -2.444487  1.959799162
FP016 -0.61719794   5.377695e-01 -2.733559 -2.631007  0.102551919
FP017  2.73915987   6.681864e-03 -2.654607 -3.098613 -0.444006259
FP018  4.26743510   2.806561e-05 -2.643402 -3.215280 -0.571878063
FP019 -2.31045847   2.207143e-02 -2.766910 -2.370603  0.396306731
FP020 -3.44119896   7.251032e-04 -2.785806 -2.224912  0.560894171
FP021  3.35165112   1.009498e-03 -2.642392 -3.272348 -0.629955482
FP022 -0.66772403   5.051252e-01 -2.728040 -2.637071  0.090969199
FP023  2.18958532   2.989162e-02 -2.673106 -3.042650 -0.369544057
FP024 -2.43189276   1.617811e-02 -2.766457 -2.340841  0.425616224
FP025 -2.68651403   7.981132e-03 -2.771677 -2.312545  0.459131121
FP026  0.58596455   5.591541e-01 -2.709082 -2.821875 -0.112793485
FP027 -4.46177875   1.714807e-05 -2.793800 -2.024516  0.769283405
FP028 -3.36478123   1.011310e-03 -2.791941 -2.101089  0.690852068
FP029  1.50309317   1.346711e-01 -2.696475 -2.913093 -0.216617374
FP030 -4.18564626   5.684141e-05 -2.799582 -1.933933  0.865649782
FP031 -0.19030898   8.494207e-01 -2.721986 -2.683765  0.038221437
FP032 -2.86824205   5.100440e-03 -2.757832 -2.224429  0.533403438
FP033 -2.48343886   1.492327e-02 -2.751062 -2.282879  0.468183359
FP034  0.81786492   4.147985e-01 -2.709737 -2.820263 -0.110526015
FP035  4.17698556   6.851675e-05 -2.659660 -3.471594 -0.811934339
FP036 -5.31186085   6.344823e-07 -2.787224 -1.880417  0.906807452
FP037  1.37213471   1.734895e-01 -2.700271 -2.960000 -0.259728507
FP038 -2.55044552   1.224045e-02 -2.764833 -2.228293  0.536540459
FP039  6.83856010   1.396591e-09 -2.588330 -4.332817 -1.744487356
FP040 -4.96957478   3.640553e-06 -2.788036 -1.771692  1.016343810
FP041  3.86443922   2.274448e-04 -2.672424 -3.403833 -0.731409091
FP042 -1.10149897   2.742144e-01 -2.729509 -2.536852  0.192657624
FP043 -0.18525729   8.535189e-01 -2.721284 -2.680317  0.040966323
FP044 15.19844350   1.458342e-22 -2.472237 -6.582105 -4.109868127
FP045  3.26197779   1.781037e-03 -2.678118 -3.403962 -0.725844224
FP046  7.19096539   1.949765e-12 -2.405146 -3.398700 -0.993554071
FP047  3.08813847   2.106659e-03 -2.611605 -3.013676 -0.402071305
FP048  0.78156187   4.354510e-01 -2.703337 -2.826102 -0.122764360
FP049  9.32620107   1.541509e-16 -2.494036 -4.334828 -1.840791658
FP050  1.78989997   7.537562e-02 -2.684810 -2.984860 -0.300049387
FP051  3.85923300   1.590148e-04 -2.656482 -3.224231 -0.567749069
FP052 -1.37622794   1.707261e-01 -2.736296 -2.542529  0.193767561
FP053  7.79872544   3.863769e-12 -2.565418 -4.201910 -1.636492479
FP054  4.71268264   7.815108e-06 -2.656678 -3.474167 -0.817488623
FP055 -2.15047129   3.539774e-02 -2.743122 -2.285294  0.457828105
FP056  6.56517336   8.289424e-09 -2.598841 -4.435323 -1.836481186
FP057  1.55970276   1.207241e-01 -2.686667 -2.952807 -0.266140351
FP058  1.31266618   1.913070e-01 -2.691483 -2.930000 -0.238517200
FP059  5.30327181   1.388228e-06 -2.662258 -3.692115 -1.029857320
FP060 -6.34967826   3.396521e-10 -3.112819 -2.294192  0.818627333
FP061 -3.23528852   1.258017e-03 -2.903859 -2.489247  0.414612257
FP062 -4.68040368   3.284921e-06 -2.978056 -2.384856  0.593200306
FP063 -5.90647947   4.865776e-09 -3.037509 -2.288593  0.748916565
FP064 -3.19849081   1.427257e-03 -2.887640 -2.481616  0.406023478
FP065 13.67947483   7.369864e-39 -1.740827 -3.389468 -1.648641212
FP066 -3.50425986   4.936856e-04 -3.034043 -2.516776  0.517267265
FP067 -3.71025855   2.192910e-04 -2.894797 -2.430554  0.464242594
FP068 -4.50468714   7.534223e-06 -2.923921 -2.356221  0.567699992
FP069 -1.39582672   1.631126e-01 -2.782438 -2.605872  0.176566128
FP070 11.33500604   6.532630e-27 -2.155840 -3.739142 -1.583301881
FP071  9.16039412   1.012284e-18 -2.295828 -3.588521 -1.292692775
FP072 -9.86673490   4.502526e-21 -3.674277 -2.222396  1.451880757
FP073 -6.31556184   4.773987e-10 -2.972104 -2.154780  0.817323998
FP074 -3.16365915   1.617158e-03 -2.849299 -2.446958  0.402341137
FP075 -4.83159241   1.618286e-06 -2.926916 -2.311584  0.615331888
FP076 18.19671006   2.170836e-57 -1.949953 -4.292756 -2.342803359
FP077 -0.24434665   8.070283e-01 -2.728715 -2.697082  0.031633203
FP078 -0.49694487   6.193690e-01 -2.737523 -2.675156  0.062366949
FP079 12.46647477   2.609452e-32 -1.649763 -3.199207 -1.549444605
FP080 -4.44534892   1.029202e-05 -2.896848 -2.308160  0.588687940
FP081  0.11125946   9.114457e-01 -2.714519 -2.729057 -0.014537653
FP082 12.55490234   3.329065e-32 -1.573824 -3.177143 -1.603319328
FP083 -6.28835488   5.760827e-10 -2.932735 -2.149385  0.783350551
FP084 -3.43524930   6.332047e-04 -2.851414 -2.386949  0.464465314
FP085 10.47209331   1.134762e-22 -2.307585 -3.916008 -1.608423485
FP086  1.02088695   3.077271e-01 -2.682101 -2.817578 -0.135477406
FP087 11.07193302   5.850147e-26 -1.684808 -3.107540 -1.422732105
FP088 -4.82078133   1.873320e-06 -2.891398 -2.233960  0.657438003
FP089 15.68684642   7.559612e-42 -2.131606 -4.506936 -2.375330025
FP090  0.72850761   4.666345e-01 -2.693950 -2.792743 -0.098793036
FP091 -1.97821299   4.847758e-02 -2.777626 -2.515187  0.262438593
FP092 12.71461669   9.160201e-31 -2.250250 -4.169957 -1.919706549
FP093  2.40580805   1.652056e-02 -2.636787 -2.972026 -0.335238658
FP094 -1.08529331   2.783195e-01 -2.751874 -2.607909  0.143965054
FP095 -4.83150303   1.885749e-06 -2.863571 -2.203780  0.659791524
FP096 -0.05816460   9.536450e-01 -2.720323 -2.712271  0.008052049
FP097  9.06740092   4.508890e-18 -2.420977 -3.684420 -1.263443027
FP098 -3.09495737   2.088014e-03 -2.820538 -2.391460  0.429077754
FP099  4.51553294   8.153915e-06 -2.575959 -3.203843 -0.627883409
FP100 -4.26730797   2.354655e-05 -2.846430 -2.293727  0.552702276
FP101 -3.33565277   9.211008e-04 -2.828760 -2.363022  0.465738108
FP102  1.25032500   2.119440e-01 -2.683373 -2.857708 -0.174335474
FP103  2.51185846   1.236590e-02 -2.644038 -2.984808 -0.340770007
FP104  1.23433987   2.176989e-01 -2.681746 -2.846934 -0.165188360
FP105  2.56644125   1.063908e-02 -2.640201 -3.003756 -0.363555025
FP106  2.42187970   1.595574e-02 -2.652367 -2.998297 -0.345929993
FP107 10.92623859   2.395320e-23 -2.328707 -4.173284 -1.844576915
FP108 -0.88386799   3.773218e-01 -2.744087 -2.619641  0.124446276
FP109  1.72666429   8.493856e-02 -2.681392 -2.891845 -0.210453156
FP110 -4.30633122   2.083157e-05 -2.839272 -2.253622  0.585649074
FP111  0.07891212   9.371465e-01 -2.716361 -2.727594 -0.011232326
FP112 13.31169435   4.090297e-31 -2.293512 -4.478541 -2.185028791
FP113 -4.25438885   2.743420e-05 -2.842824 -2.207527  0.635296648
FP114  0.38442341   7.009005e-01 -2.711034 -2.759459 -0.048425836
FP115 -0.49398272   6.216320e-01 -2.730653 -2.663059  0.067594185
FP116 -3.39726200   7.657795e-04 -2.815911 -2.310055  0.505856814
FP117  3.16005628   1.769096e-03 -2.623060 -3.157353 -0.534292762
FP118 -3.88255786   1.272871e-04 -2.835755 -2.226776  0.608979252
FP119 -0.71996857   4.720764e-01 -2.734485 -2.636839  0.097646215
FP120 -3.25854728   1.280523e-03 -2.807793 -2.270759  0.537033697
FP121  0.62156119   5.349141e-01 -2.704487 -2.805188 -0.100701417
FP122 -2.44169102   1.530759e-02 -2.781836 -2.396154  0.385682632
FP123  3.52755166   4.929055e-04 -2.628914 -3.165157 -0.536243091
FP124 -3.58983366   3.953044e-04 -2.806888 -2.261494  0.545394825
FP125 -2.91655379   3.853055e-03 -2.786364 -2.350743  0.435620393
FP126 -1.44180023   1.505173e-01 -2.748395 -2.547234  0.201161019
FP127 -2.66597987   8.213408e-03 -2.773386 -2.381429  0.391957737
FP128 -3.37747584   8.536233e-04 -2.794086 -2.284752  0.509334647
FP129  3.28855844   1.192299e-03 -2.642100 -3.193030 -0.550930181
FP130  1.02990587   3.048783e-01 -2.698555 -2.888900 -0.190345358
FP131 -0.49682548   6.198471e-01 -2.727954 -2.653583  0.074370939
FP132 -5.89680424   1.633112e-08 -2.832055 -1.925126  0.906929238
FP133 -1.83896087   6.756107e-02 -2.757100 -2.451750  0.305349880
FP134  3.16620016   1.761695e-03 -2.661506 -3.110000 -0.448493976
FP135 -2.94236705   3.709259e-03 -2.783827 -2.266667  0.517160048
FP136 -2.02006233   4.501990e-02 -2.761938 -2.403304  0.358633451
FP137 -0.07855180   9.374873e-01 -2.720131 -2.706636  0.013494433
FP138 -1.44829927   1.496787e-01 -2.748083 -2.483302  0.264780953
FP139 -0.22212826   8.246439e-01 -2.721936 -2.680897  0.041038417
FP140 -1.86990507   6.355486e-02 -2.758036 -2.403962  0.354073239
FP141  4.15441700   4.792655e-05 -2.650655 -3.232523 -0.581867761
FP142 -2.92307611   4.047862e-03 -2.779233 -2.224519  0.554713355
FP143  0.83414756   4.061300e-01 -2.705904 -2.862338 -0.156433772
FP144 -4.98991305   1.904653e-06 -2.819214 -1.852424  0.966789373
FP145 -3.99831545   1.002597e-04 -2.787077 -2.128990  0.658087566
FP146  6.08904552   1.064009e-08 -2.608687 -3.675000 -1.066313013
FP147 -2.98364059   3.376138e-03 -2.776357 -2.226800  0.549557227
FP148 -4.00444775   1.101041e-04 -2.780300 -2.073012  0.707287491
FP149  9.67498002   8.530838e-16 -2.479225 -5.125930 -2.646704799
FP150 -1.59224059   1.145443e-01 -2.742808 -2.435467  0.307341553
FP151 -1.68674372   9.608846e-02 -2.736013 -2.423019  0.312994495
FP152  2.02103329   4.549820e-02 -2.692325 -3.012308 -0.319982377
FP153  0.83775227   4.044086e-01 -2.703900 -2.892432 -0.188532775
FP154 -0.18701160   8.526043e-01 -2.720525 -2.668889  0.051635701
FP155  4.93743429   3.813516e-06 -2.653412 -3.592273 -0.938860298
FP156  2.70254904   8.178498e-03 -2.685045 -3.160896 -0.475850274
FP157 -1.19798365   2.351567e-01 -2.738105 -2.423220  0.314885042
FP158 -3.18371959   2.293303e-03 -2.757078 -2.039020  0.718058170
FP159  2.90626659   4.444806e-03 -2.687590 -3.127313 -0.439722935
FP160  0.72930617   4.673596e-01 -2.711400 -2.816308 -0.104908144
FP161 -8.02084404   8.158474e-12 -2.826779 -1.193333  1.633445946
FP162  9.05654884   7.502729e-19 -2.147208 -3.300849 -1.153640924
FP163 -4.73411111   2.565152e-06 -3.009759 -2.398455  0.611304290
FP164 11.15556043   6.131703e-27 -1.830706 -3.245042 -1.414335661
FP165 -3.26163144   1.150990e-03 -2.862294 -2.450602  0.411691613
FP166  6.01599552   3.059094e-09 -2.441541 -3.277905 -0.836363881
FP167 -3.77468033   1.718080e-04 -2.874742 -2.398718  0.476023835
FP168 12.78784085   6.302482e-34 -1.659686 -3.250521 -1.590835792
FP169 10.79840624   1.952902e-22 -2.370413 -4.241017 -1.870603512
FP170  1.45059296   1.480425e-01 -2.674961 -2.911943 -0.236981517
FP171 -3.56151646   4.354270e-04 -2.810722 -2.266398  0.544324003
FP172 13.04070659   8.112523e-28 -2.345390 -4.809931 -2.464540221
FP173  2.68918003   7.770466e-03 -2.653554 -3.111556 -0.458001634
FP174  0.94721964   3.446525e-01 -2.699492 -2.845806 -0.146314311
FP175  0.01020115   9.918704e-01 -2.718360 -2.719922 -0.001562215
FP176 -2.29447613   2.298911e-02 -2.766395 -2.374310  0.392084865
FP177 -1.08253877   2.802959e-01 -2.737548 -2.580609  0.156939151
FP178  3.27582610   1.258481e-03 -2.656782 -3.167739 -0.510956834
FP179  0.85670987   3.931634e-01 -2.703846 -2.854409 -0.150562448
FP180 -2.83913345   5.188161e-03 -2.773274 -2.263235  0.510039146
FP181  6.24259165   6.005980e-09 -2.617726 -3.695281 -1.077554681
FP182 -2.11887211   3.595632e-02 -2.755239 -2.384255  0.370983887
FP183 -2.62186301   1.015591e-02 -2.755210 -2.271250  0.483960466
FP184 10.24979020   9.572172e-17 -2.493318 -5.171000 -2.677681975
FP185  3.21519455   1.718715e-03 -2.667230 -3.270000 -0.602770115
FP186 -2.10893733   3.756740e-02 -2.749818 -2.342740  0.407078042
FP187 -0.14233858   8.871705e-01 -2.721122 -2.685942  0.035180420
FP188 -2.76497219   7.083803e-03 -2.760011 -2.153692  0.606318979
FP189  0.29230393   7.707177e-01 -2.713884 -2.774932 -0.061047680
FP190  8.23796541   2.799252e-12 -2.574785 -4.556522 -1.981737159
FP191 -1.62000293   1.089976e-01 -2.742364 -2.404627  0.337737388
FP192  0.55100083   5.833593e-01 -2.711377 -2.829310 -0.117932965
FP193 11.06173597   1.595927e-16 -2.525146 -5.642881 -3.117735616
FP194 -1.03294441   3.047671e-01 -2.728916 -2.553214  0.175701915
FP195 -5.88072667   1.035398e-07 -2.786495 -1.672759  1.113736340
FP196  6.42707826   1.269199e-08 -2.651126 -3.838889 -1.187762913
FP197  3.82944792   3.167065e-04 -2.670555 -3.583800 -0.913245061
FP198 -3.87872401   2.598433e-04 -2.776165 -1.761852  1.014313143
FP199  0.59118217   5.569865e-01 -2.711578 -2.859333 -0.147754967
FP200  5.15622561   3.020793e-06 -2.668319 -3.685106 -1.016787799
FP201 -3.92629512   2.100852e-04 -2.757414 -2.018600  0.738813984
FP202  5.92935333   6.082278e-09 -2.496969 -3.357143 -0.860174019
FP203  1.09341446   2.759667e-01 -2.695582 -2.896147 -0.200564841
FP204  2.86078975   4.868444e-03 -2.672159 -3.141702 -0.469543435
FP205  5.61427744   2.488511e-07 -2.605564 -4.057838 -1.452273414
FP206  3.58353985   6.162975e-04 -2.674519 -3.409474 -0.734954669
FP207  8.34894566   1.153650e-11 -2.595151 -4.768704 -2.173553202
FP208  1.37823055   1.702203e-01 -2.690237 -2.942056 -0.251819108
In [131]:
%%R

## Create a volcano plot
print(
xyplot(-log10(t.test_p.value) ~ difference,
       data = tests,
       xlab = "Mean With Structure - Mean Without Structure",
       ylab = "-log(p-Value)",
       type = "p")
)
In [97]:
%%R

### Section 18.2 Categorical Outcomes

## Load the segmentation data

data(segmentationData)
segTrain <- subset(segmentationData, Case == "Train")
segTrain$Case <- segTrain$Cell <- NULL

segTest <- subset(segmentationData, Case != "Train")
segTest$Case <- segTest$Cell <- NULL

## Compute the areas under the ROC curve
aucVals <- filterVarImp(x = segTrain[, -1], y = segTrain$Class)
aucVals$Predictor <- rownames(aucVals)

## Cacluate the t-tests as before but with x and y switched
segTests <- apply(segTrain[, -1], 2,
                  function(x, y)
                    {
                    tStats <- t.test(x ~ y)[c("statistic", "p.value", "estimate")]
                    unlist(tStats)
                    },
               y = segTrain$Class)
segTests <- as.data.frame(t(segTests))
names(segTests) <- c("t.Statistic", "t.test_p.value", "mean0", "mean1")
segTests$Predictor <- rownames(segTests)

## Fit a random forest model and get the importance scores
library(randomForest)
set.seed(791)
rfImp <- randomForest(Class ~ ., data = segTrain,
                      ntree = 2000,
                      importance = TRUE)
rfValues <- data.frame(RF = importance(rfImp)[, "MeanDecreaseGini"],
                       Predictor = rownames(importance(rfImp)))

## Now compute the Relief scores
set.seed(791)

ReliefValues <- attrEval(Class ~ ., data = segTrain,
                         estimator="ReliefFequalK", ReliefIterations = 50)
ReliefValues <- data.frame(Relief = ReliefValues,
                           Predictor = names(ReliefValues))

## and the MIC statistics
set.seed(791)
segMIC <- mine(x = segTrain[, -1],
               ## Pass the outcome as 0/1
               y = ifelse(segTrain$Class == "PS", 1, 0))$MIC
segMIC <- data.frame(Predictor = rownames(segMIC),
                  MIC = segMIC[,1])


rankings <- merge(segMIC, ReliefValues)
rankings <- merge(rankings, rfValues)
rankings <- merge(rankings, segTests)
rankings <- merge(rankings, aucVals)
print(
rankings
)
                 Predictor         MIC       Relief        RF  t.Statistic
1                 AngleCh1 0.131057008  0.002287557  4.730963  -0.21869850
2                  AreaCh1 0.108083908  0.016041257  4.315317  -0.93160658
3              AvgIntenCh1 0.292046076  0.071057681 18.865802 -11.75400848
4              AvgIntenCh2 0.329484594  0.150684824 21.857848 -16.09400822
5              AvgIntenCh3 0.135443794  0.018172519  5.135363  -0.14752973
6              AvgIntenCh4 0.166545039 -0.007167866  5.434737  -6.23725001
7   ConvexHullAreaRatioCh1 0.299627157  0.035983697 19.093048  14.22756193
8  ConvexHullPerimRatioCh1 0.254931744  0.041865999 12.624038 -13.86697029
9      DiffIntenDensityCh1 0.239224382  0.038582763  7.335741  -9.81721615
10     DiffIntenDensityCh3 0.133084659  0.010830941  6.647198   1.48785690
11     DiffIntenDensityCh4 0.147643832  0.042352546  5.386981  -5.54840221
12         EntropyIntenCh1 0.261097110  0.129280729 13.867582 -14.04326173
13         EntropyIntenCh3 0.172122729  0.039687246  5.127465   6.94689541
14         EntropyIntenCh4 0.185625627  0.021260676  5.742739  -9.03621024
15           EqCircDiamCh1 0.108083908  0.038820971  4.185607  -1.85186912
16         EqEllipseLWRCh1 0.212579943  0.016550609  5.708705   9.83868863
17   EqEllipseOblateVolCh1 0.122276159  0.010367074  3.906543   1.35616134
18  EqEllipseProlateVolCh1 0.169674904 -0.005386670  6.018121  -1.29243801
19         EqSphereAreaCh1 0.108083908  0.016110539  4.183567  -0.93273061
20          EqSphereVolCh1 0.108083908  0.003440003  4.133475  -0.04348657
21          FiberAlign2Ch3 0.177116842 -0.002628403  4.373886   3.65095007
22          FiberAlign2Ch4 0.149937844  0.016047962  4.868552   2.07009183
23          FiberLengthCh1 0.220505513  0.050610471  8.368712   9.26429955
24           FiberWidthCh1 0.368720274  0.107691201 33.371913 -18.96852051
25         IntenCoocASMCh3 0.196466490  0.024738010  7.298595  -7.95107008
26         IntenCoocASMCh4 0.147981004  0.005574684  3.734085   4.51016239
27    IntenCoocContrastCh3 0.231500707  0.021282305  8.438533  13.20540372
28    IntenCoocContrastCh4 0.135150335 -0.002605380  4.567712   1.02551789
29     IntenCoocEntropyCh3 0.202905819  0.039769279  6.354566   9.62738946
30     IntenCoocEntropyCh4 0.148928924  0.042214966  4.234247  -5.73801017
31         IntenCoocMaxCh3 0.193078547  0.039834486  6.865277 -10.01109754
32         IntenCoocMaxCh4 0.152580596  0.064488810  3.966995   5.02868895
33            KurtIntenCh1 0.200874103  0.003243188  7.095402   3.18226166
34            KurtIntenCh3 0.135694293  0.010944913  4.237905  -2.46783420
35            KurtIntenCh4 0.152775633  0.011328311  5.339427   4.39807449
36               LengthCh1 0.149378763  0.044483732  4.235474   5.28480181
37      NeighborAvgDistCh1 0.123412342  0.023330722  4.266566  -0.46614250
38      NeighborMinDistCh1 0.125623472  0.007850922  5.152365   0.80769702
39      NeighborVarDistCh1 0.124259322  0.016447793  4.286239   0.29886752
40                PerimCh1 0.170013515  0.025272254  4.115593   6.18542523
41             ShapeBFRCh1 0.235667275  0.005194794  9.782458 -13.25311412
42             ShapeLWRCh1 0.183599199  0.029568271  4.745873   8.40241429
43             ShapeP2ACh1 0.332238080  0.073795605 19.362332  14.75801555
44            SkewIntenCh1 0.259680600  0.085229983 13.628434   9.66411304
45            SkewIntenCh3 0.149153858  0.056669970  4.244103  -3.76453794
46            SkewIntenCh4 0.152202895  0.002508761  5.478398   6.46619794
47       SpotFiberCountCh3 0.005721744 -0.005692308  1.793200  -0.53238018
48       SpotFiberCountCh4 0.019496167 -0.015192982  2.948225   2.98634139
49           TotalIntenCh1 0.304429766  0.045548534 20.916993  -8.20041297
50           TotalIntenCh2 0.400952572  0.185416030 41.617068 -14.54087193
51           TotalIntenCh3 0.115771733  0.015068883  5.402005  -0.46828755
52           TotalIntenCh4 0.186643156  0.006071748  5.712561  -5.64791505
53             VarIntenCh1 0.241235863  0.045687478  9.259561 -10.40110966
54             VarIntenCh3 0.150238051  0.002815999  5.176123  -2.44172596
55             VarIntenCh4 0.171222193  0.001547820  5.981325  -4.83455579
56                WidthCh1 0.146204548  0.021560423  5.113884  -1.59227638
57               XCentroid 0.106662637 -0.037877551  4.220162   1.10633278
58               YCentroid 0.119516938  0.055209622  4.908536   2.19081435
   t.test_p.value        mean0        mean1        PS        WS
1    8.269443e-01 9.086539e+01 9.157148e+01 0.5025967 0.5025967
2    3.517830e-01 3.205519e+02 3.329249e+02 0.5709170 0.5709170
3    4.819837e-28 7.702212e+01 2.146922e+02 0.7662375 0.7662375
4    2.530403e-50 1.324405e+02 2.778397e+02 0.7866146 0.7866146
5    8.827553e-01 9.578766e+01 9.671147e+01 0.5214098 0.5214098
6    7.976250e-10 1.168287e+02 1.795797e+02 0.6473814 0.6473814
7    5.895088e-42 1.270408e+00 1.114054e+00 0.7815519 0.7815519
8    4.644231e-40 8.714806e-01 9.310403e-01 0.7547844 0.7547844
9    6.509740e-21 6.055821e+01 9.601373e+01 0.7161591 0.7161591
10   1.371842e-01 7.753072e+01 7.104993e+01 0.5427353 0.5427353
11   4.178896e-08 7.508542e+01 1.061125e+02 0.6294704 0.6294704
12   5.145995e-40 6.364841e+00 7.004622e+00 0.7565169 0.7565169
13   8.836060e-12 5.704662e+00 5.014508e+00 0.6340145 0.6340145
14   9.775620e-19 5.192365e+00 6.023039e+00 0.6661861 0.6661861
15   6.437960e-02 1.940093e+01 2.002646e+01 0.5709170 0.5709170
16   7.218411e-22 2.371177e+00 1.758240e+00 0.6965915 0.6965915
17   1.753561e-01 7.632288e+02 6.866693e+02 0.5045568 0.5045568
18   1.965213e-01 3.543481e+02 3.920429e+02 0.6301870 0.6301870
19   3.512025e-01 1.284179e+03 1.333731e+03 0.5709170 0.5709170
20   9.653226e-01 5.017110e+03 5.033648e+03 0.5709170 0.5709170
21   2.770065e-04 1.479185e+00 1.421565e+00 0.5690728 0.5690728
22   3.873106e-02 1.444148e+00 1.412867e+00 0.5421535 0.5421535
23   1.239044e-19 3.991835e+01 2.819142e+01 0.7007984 0.7007984
24   1.162284e-64 8.691444e+00 1.282684e+01 0.8355127 0.8355127
25   1.067683e-14 7.373161e-02 1.559897e-01 0.6956093 0.6956093
26   7.290850e-06 1.131789e-01 7.724074e-02 0.5878438 0.5878438
27   7.794899e-37 1.163875e+01 6.292079e+00 0.7214199 0.7214199
28   3.053656e-01 7.700191e+00 7.343397e+00 0.5358642 0.5358642
29   1.282007e-20 6.201308e+00 5.216667e+00 0.6891345 0.6891345
30   1.313352e-08 5.545934e+00 6.032306e+00 0.6073356 0.6073356
31   4.418432e-22 1.900393e-01 3.245564e-01 0.6944627 0.6944627
32   5.990072e-07 2.707207e-01 2.131262e-01 0.5892938 0.5892938
33   1.506054e-03 1.208829e+00 3.868323e-01 0.6711982 0.6711982
34   1.388162e-02 3.121647e+00 4.480168e+00 0.5513936 0.5513936
35   1.210957e-05 1.388322e+00 2.421078e-01 0.6046335 0.6046335
36   1.571520e-07 3.237304e+01 2.839838e+01 0.6015142 0.6015142
37   6.412508e-01 2.294382e+02 2.307292e+02 0.5047676 0.5047676
38   4.194740e-01 3.020875e+01 2.962558e+01 0.5018274 0.5018274
39   7.651196e-01 1.046047e+02 1.042038e+02 0.5072546 0.5072546
40   9.075622e-10 9.721959e+01 8.203652e+01 0.6200196 0.6200196
41   6.819382e-37 5.630603e-01 6.406694e-01 0.7319836 0.7319836
42   1.498789e-16 1.968091e+00 1.601640e+00 0.6607778 0.6607778
43   9.265729e-45 2.380621e+00 1.606325e+00 0.7930978 0.7930978
44   6.631564e-21 8.687084e-01 4.124373e-01 0.7253275 0.7253275
45   1.819323e-04 1.429871e+00 1.711829e+00 0.5732881 0.5732881
46   1.592246e-10 1.069003e+00 7.366442e-01 0.6193873 0.6193873
47   5.946089e-01 1.915094e+00 1.970509e+00 0.5173630 0.5173630
48   2.894728e-03 7.224843e+00 6.477212e+00 0.4619775 0.4619775
49   1.624963e-15 2.494150e+04 6.265354e+04 0.7895358 0.7895358
50   3.385024e-43 3.858694e+04 7.665351e+04 0.8012840 0.8012840
51   6.397155e-01 2.685926e+04 2.770986e+04 0.5094972 0.5094972
52   2.290183e-08 3.466429e+04 5.217025e+04 0.6599073 0.6599073
53   5.662429e-23 5.142099e+01 1.136596e+02 0.7322365 0.7322365
54   1.488950e-02 9.519852e+01 1.127093e+02 0.5330821 0.5330821
55   1.632212e-06 1.063653e+02 1.430475e+02 0.6322357 0.6322357
56   1.116486e-01 1.754162e+01 1.813792e+01 0.5799484 0.5799484
57   2.689098e-01 2.698852e+02 2.599759e+02 0.5216669 0.5216669
58   2.875168e-02 1.842972e+02 1.691475e+02 0.5407878 0.5407878
In [96]:
%%R

rankings$channel <- "Channel 1"
rankings$channel[grepl("Ch2$", rankings$Predictor)] <- "Channel 2"
rankings$channel[grepl("Ch3$", rankings$Predictor)] <- "Channel 3"
rankings$channel[grepl("Ch4$", rankings$Predictor)] <- "Channel 4"
rankings$t.Statistic <- abs(rankings$t.Statistic)

print(
splom(~rankings[, c("PS", "t.Statistic", "RF", "Relief", "MIC")],
      groups = rankings$channel,
      varnames = c("ROC\nAUC", "Abs\nt-Stat", "Random\nForest", "Relief", "MIC"),
      auto.key = list(columns = 2))
)
In [100]:
%%R

## Load the grant data. A script to create and save these data is contained
## in the same directory as this file.

source( file.path( scriptLocation(), "CreateGrantData.R" ),  echo=TRUE )

load("grantData.RData")

dataSubset <- training[pre2008, c("Sponsor62B", "ContractValueBandUnk", "RFCD240302")]

## This is a simple function to compute several statistics for binary predictors
tableCalcs <- function(x, y)
  {
  tab <- table(x, y)
  fet <- fisher.test(tab)
  out <- c(OR = fet$estimate,
           P = fet$p.value,
           Gain = attrEval(y ~ x, estimator = "GainRatio"))
  }

## lapply() is used to execute the function on each column
tableResults <- lapply(dataSubset, tableCalcs, y = training[pre2008, "Class"])

## The results come back as a list of vectors, and "rbind" is used to join
## then together as rows of a table
tableResults <- do.call("rbind", tableResults)
print(
    tableResults
)

## The permuted Relief scores can be computed using a function from the
## AppliedPredictiveModeling package.

permuted <- permuteRelief(x = training[pre2008, c("Sponsor62B", "Day", "NumCI")],
                          y = training[pre2008, "Class"],
                          nperm = 500,
                          ### the remaining options are passed to attrEval()
                          estimator="ReliefFequalK",
                          ReliefIterations= 50)

## The original Relief scores:
print(
permuted$observed
)

## The number of standard deviations away from the permuted mean:
print(
permuted$standardized
)

## The distributions of the scores if there were no relationship between the
## predictors and outcomes

print(
histogram(~value|Predictor,
          data = permuted$permutations,
          xlim = extendrange(permuted$permutations$value),
          xlab = "Relief Score")
)
> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Jo .... [TRUNCATED] 

> library(caret)

> library(lubridate)

Attaching package: 'lubridate'

The following object is masked from 'package:plyr':

    here


> ## How many cores on the machine should be used for the data
> ## processing. Making cores > 1 will speed things up (depending on your
> ## machine) .... [TRUNCATED] 
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'unimelb_training.csv': No such file or directory
Error in file(file, "rt") : cannot open the connection
In [98]:
%%R

showChapterScript(19)
NULL
In [88]:
%%R

showChapterOutput(19)
R Information
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com) 
> ###
> ### Chapter 19: An Introduction to Feature Selection
> ###
> ### Required packages: AppliedPredictiveModeling, caret, MASS, corrplot,
> ###                    RColorBrewer, randomForest, kernlab, klaR,
> ###                   
> ###
> ### Data used: The Alzheimer disease data from the AppliedPredictiveModeling 
> ###            package
> ###
> ### Notes: 
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be 
> ### syntax differences that occur over time as packages evolve. These files 
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in 
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
> 
> 
> 
> ################################################################################
> ### Section 19.6 Case Study: Predicting Cognitive Impairment
> 
> 
> library(AppliedPredictiveModeling)
> data(AlzheimerDisease)
> 
> ## The baseline set of predictors
> bl <- c("Genotype", "age", "tau", "p_tau", "Ab_42", "male")
> 
> ## The set of new assays
> newAssays <- colnames(predictors)
> newAssays <- newAssays[!(newAssays %in% c("Class", bl))]
> 
> ## Decompose the genotype factor into binary dummy variables
> 
> predictors$E2 <- predictors$E3 <- predictors$E4 <- 0
> predictors$E2[grepl("2", predictors$Genotype)] <- 1
> predictors$E3[grepl("3", predictors$Genotype)] <- 1
> predictors$E4[grepl("4", predictors$Genotype)] <- 1
> genotype <-  predictors$Genotype
> 
> ## Partition the data
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> set.seed(730)
> split <- createDataPartition(diagnosis, p = .8, list = FALSE)
> 
> adData <- predictors
> adData$Class <- diagnosis
> 
> training <- adData[ split, ]
> testing  <- adData[-split, ]
> 
> predVars <- names(adData)[!(names(adData) %in% c("Class",  "Genotype"))]
> 
> ## This summary function is used to evaluate the models.
> fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
> 
> ## We create the cross-validation files as a list to use with different 
> ## functions
> 
> set.seed(104)
> index <- createMultiFolds(training$Class, times = 5)
> 
> ## The candidate set of the number of predictors to evaluate
> varSeq <- seq(1, length(predVars)-1, by = 2)
> 
> ## We can also use parallel processing to run each resampled RFE
> ## iteration (or resampled model with train()) using different
> ## workers.
> 
> library(doMC)
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
> registerDoMC(15)
> 
> 
> ## The rfe() function in the caret package is used for recursive feature 
> ## elimiation. We setup control functions for this and train() that use
> ## the same cross-validation folds. The 'ctrl' object will be modifed several
> ## times as we try different models
> 
> ctrl <- rfeControl(method = "repeatedcv", repeats = 5,
+                    saveDetails = TRUE,
+                    index = index,
+                    returnResamp = "final")
> 
> fullCtrl <- trainControl(method = "repeatedcv",
+                          repeats = 5,
+                          summaryFunction = fiveStats,
+                          classProbs = TRUE,
+                          index = index)
> 
> ## The correlation matrix of the new data
> predCor <- cor(training[, newAssays])
> 
> library(RColorBrewer)
> cols <- c(rev(brewer.pal(7, "Blues")),
+           brewer.pal(7, "Reds"))
> library(corrplot)
> corrplot(predCor,
+          order = "hclust",
+          tl.pos = "n",addgrid.col = rgb(1,1,1,.01),
+          col = colorRampPalette(cols)(51))
> 
> ## Fit a series of models with the full set of predictors
> set.seed(721)
> rfFull <- train(training[, predVars],
+                 training$Class,
+                 method = "rf",
+                 metric = "ROC",
+                 tuneGrid = data.frame(mtry = floor(sqrt(length(predVars)))),
+                 ntree = 1000,
+                 trControl = fullCtrl)
Loading required package: randomForest
randomForest 4.6-7
Type rfNews() to see new features/changes/bug fixes.
Loading required package: pROC
Loading required package: plyr
Type 'citation("pROC")' for a citation.

Attaching package: ‘pROC’

The following object is masked from ‘package:stats’:

    cov, smooth, var

Loading required package: class
> rfFull
Random Forest 

267 samples
132 predictors
  2 classes: 'Impaired', 'Control' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 241, 241, 241, 240, 240, 240, ... 

Resampling results

  ROC   Sens  Spec   Accuracy  Kappa  ROC SD  Sens SD  Spec SD  Accuracy SD
  0.89  0.45  0.985  0.838     0.506  0.0674  0.173    0.0276   0.0508     
  Kappa SD
  0.187   

Tuning parameter 'mtry' was held constant at a value of 11
 
> 
> set.seed(721)
> ldaFull <- train(training[, predVars],
+                  training$Class,
+                  method = "lda",
+                  metric = "ROC",
+                  ## The 'tol' argument helps lda() know when a matrix is 
+                  ## singular. One of the predictors has values very close to 
+                  ## zero, so we raise the vaue to be smaller than the default
+                  ## value of 1.0e-4.
+                  tol = 1.0e-12,
+                  trControl = fullCtrl)
Loading required package: MASS

Attaching package: ‘MASS’

The following object is masked _by_ ‘.GlobalEnv’:

    genotype

> ldaFull
Linear Discriminant Analysis 

267 samples
132 predictors
  2 classes: 'Impaired', 'Control' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 241, 241, 241, 240, 240, 240, ... 

Resampling results

  ROC    Sens   Spec   Accuracy  Kappa  ROC SD  Sens SD  Spec SD  Accuracy SD
  0.844  0.686  0.829  0.79      0.491  0.0859  0.18     0.0819   0.0659     
  Kappa SD
  0.161   

 
> 
> set.seed(721)
> svmFull <- train(training[, predVars],
+                  training$Class,
+                  method = "svmRadial",
+                  metric = "ROC",
+                  tuneLength = 12,
+                  preProc = c("center", "scale"),
+                  trControl = fullCtrl)
Loading required package: kernlab
> svmFull
Support Vector Machines with Radial Basis Function Kernel 

267 samples
132 predictors
  2 classes: 'Impaired', 'Control' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 241, 241, 241, 240, 240, 240, ... 

Resampling results across tuning parameters:

  C     ROC    Sens   Spec   Accuracy  Kappa  ROC SD  Sens SD  Spec SD
  0.25  0.879  0.725  0.899  0.851     0.625  0.0806  0.157    0.0729 
  0.5   0.879  0.735  0.896  0.852     0.629  0.0806  0.158    0.0769 
  1     0.885  0.706  0.923  0.863     0.645  0.0794  0.157    0.0685 
  2     0.892  0.696  0.933  0.868     0.653  0.0766  0.163    0.0632 
  4     0.886  0.682  0.931  0.863     0.637  0.0762  0.15     0.0565 
  8     0.88   0.644  0.927  0.85      0.599  0.0764  0.145    0.0507 
  16    0.881  0.652  0.923  0.849     0.599  0.076   0.142    0.0516 
  32    0.881  0.652  0.928  0.853     0.607  0.076   0.142    0.0492 
  64    0.881  0.644  0.925  0.848     0.596  0.076   0.14     0.0518 
  128   0.881  0.642  0.921  0.844     0.588  0.076   0.137    0.0556 
  256   0.881  0.647  0.926  0.85      0.599  0.076   0.145    0.0494 
  512   0.881  0.644  0.924  0.847     0.593  0.076   0.145    0.0529 
  Accuracy SD  Kappa SD
  0.0679       0.167   
  0.067        0.163   
  0.0655       0.166   
  0.0573       0.152   
  0.0529       0.143   
  0.0498       0.137   
  0.0502       0.137   
  0.0459       0.127   
  0.0476       0.13    
  0.0491       0.131   
  0.0455       0.127   
  0.0477       0.132   

Tuning parameter 'sigma' was held constant at a value of 0.004505826
ROC was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.00451 and C = 2. 
> 
> set.seed(721)
> nbFull <- train(training[, predVars],
+                 training$Class,
+                 method = "nb",
+                 metric = "ROC",
+                 trControl = fullCtrl)
Loading required package: klaR
> nbFull
Naive Bayes 

267 samples
132 predictors
  2 classes: 'Impaired', 'Control' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 241, 241, 241, 240, 240, 240, ... 

Resampling results across tuning parameters:

  usekernel  ROC    Sens   Spec   Accuracy  Kappa  ROC SD  Sens SD  Spec SD
  FALSE      0.778  0.644  0.78   0.742     0.395  0.107   0.173    0.0931 
  TRUE       0.798  0.594  0.814  0.753     0.397  0.0952  0.174    0.0971 
  Accuracy SD  Kappa SD
  0.0699       0.155   
  0.0792       0.182   

Tuning parameter 'fL' was held constant at a value of 0
ROC was used to select the optimal model using  the largest value.
The final values used for the model were fL = 0 and usekernel = TRUE. 
> 
> lrFull <- train(training[, predVars],
+                 training$Class,
+                 method = "glm",
+                 metric = "ROC",
+                 trControl = fullCtrl)
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
> lrFull
Generalized Linear Model 

267 samples
132 predictors
  2 classes: 'Impaired', 'Control' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 241, 241, 241, 240, 240, 240, ... 

Resampling results

  ROC    Sens  Spec   Accuracy  Kappa  ROC SD  Sens SD  Spec SD  Accuracy SD
  0.785  0.67  0.778  0.748     0.417  0.101   0.165    0.11     0.0825     
  Kappa SD
  0.172   

 
> 
> set.seed(721)
> knnFull <- train(training[, predVars],
+                  training$Class,
+                  method = "knn",
+                  metric = "ROC",
+                  tuneLength = 20,
+                  preProc = c("center", "scale"),
+                  trControl = fullCtrl)
> knnFull
k-Nearest Neighbors 

267 samples
132 predictors
  2 classes: 'Impaired', 'Control' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 241, 241, 241, 240, 240, 240, ... 

Resampling results across tuning parameters:

  k   ROC    Sens    Spec   Accuracy  Kappa   ROC SD  Sens SD  Spec SD
  5   0.753  0.476   0.928  0.804     0.444   0.142   0.184    0.061  
  7   0.76   0.455   0.94   0.807     0.445   0.136   0.157    0.0585 
  9   0.788  0.391   0.963  0.806     0.414   0.107   0.157    0.0374 
  11  0.794  0.369   0.973  0.808     0.408   0.114   0.149    0.0335 
  13  0.79   0.336   0.967  0.794     0.362   0.14    0.15     0.034  
  15  0.817  0.328   0.967  0.792     0.353   0.0753  0.152    0.0411 
  17  0.821  0.298   0.979  0.793     0.338   0.0736  0.157    0.0328 
  19  0.837  0.282   0.986  0.793     0.328   0.0704  0.168    0.0253 
  21  0.847  0.265   0.985  0.788     0.307   0.0704  0.169    0.0261 
  23  0.846  0.248   0.984  0.782     0.292   0.0673  0.121    0.03   
  25  0.843  0.232   0.987  0.78      0.276   0.073   0.126    0.0229 
  27  0.846  0.212   0.989  0.776     0.258   0.0669  0.108    0.0216 
  29  0.849  0.196   0.991  0.773     0.242   0.0687  0.103    0.0201 
  31  0.847  0.182   0.988  0.767     0.221   0.0703  0.0962   0.0268 
  33  0.842  0.171   0.99   0.766     0.209   0.0721  0.107    0.0208 
  35  0.843  0.157   0.991  0.762     0.193   0.0728  0.105    0.0201 
  37  0.842  0.138   0.991  0.757     0.169   0.0705  0.102    0.02   
  39  0.837  0.121   0.995  0.756     0.154   0.0731  0.104    0.0158 
  41  0.831  0.0961  0.995  0.749     0.122   0.0738  0.0932   0.0156 
  43  0.82   0.0739  0.996  0.744     0.0939  0.107   0.0854   0.0142 
  Accuracy SD  Kappa SD
  0.0661       0.195   
  0.0581       0.166   
  0.0541       0.177   
  0.0528       0.174   
  0.0512       0.177   
  0.0517       0.178   
  0.0494       0.183   
  0.05         0.191   
  0.0488       0.186   
  0.0394       0.144   
  0.0386       0.15    
  0.0364       0.135   
  0.0342       0.129   
  0.0326       0.119   
  0.0359       0.139   
  0.0352       0.136   
  0.0328       0.128   
  0.0353       0.137   
  0.0312       0.126   
  0.0288       0.116   

ROC was used to select the optimal model using  the largest value.
The final value used for the model was k = 29. 
> 
> ## Now fit the RFE versions. To do this, the 'functions' argument of the rfe()
> ## object is modified to the approproate functions. For model details about 
> ## these functions and their arguments, see 
> ##
> ##   http://caret.r-forge.r-project.org/featureSelection.html
> ##
> ## for more information.
> 
> 
> 
> 
> ctrl$functions <- rfFuncs
> ctrl$functions$summary <- fiveStats
> set.seed(721)
> rfRFE <- rfe(training[, predVars],
+              training$Class,
+              sizes = varSeq,
+              metric = "ROC",
+              ntree = 1000,
+              rfeControl = ctrl)
> rfRFE

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 

Resampling performance over subset size:

 Variables    ROC   Sens   Spec Accuracy  Kappa   ROCSD SensSD  SpecSD
         1 0.8067 0.5418 0.8785   0.7867 0.4344 0.09395 0.1856 0.07326
         3 0.8590 0.6518 0.9185   0.8457 0.5929 0.08670 0.1705 0.06688
         5 0.8872 0.6521 0.9468   0.8661 0.6355 0.08310 0.1743 0.06089
         7 0.8870 0.6546 0.9446   0.8652 0.6320 0.11025 0.1929 0.05844
         9 0.8985 0.6711 0.9549   0.8771 0.6618 0.07643 0.1890 0.04689
        11 0.8956 0.6975 0.9611   0.8886 0.6954 0.10711 0.1722 0.04070
        13 0.8996 0.6696 0.9650   0.8839 0.6791 0.10124 0.1659 0.04204
        15 0.8964 0.6782 0.9608   0.8832 0.6771 0.10232 0.1881 0.04215
        17 0.8994 0.6754 0.9619   0.8832 0.6785 0.07797 0.1706 0.04231
        19 0.8965 0.6696 0.9651   0.8840 0.6779 0.07583 0.1823 0.04294
        21 0.8978 0.6450 0.9702   0.8810 0.6645 0.07639 0.1824 0.03578
        23 0.8965 0.6557 0.9651   0.8803 0.6662 0.07511 0.1791 0.04173
        25 0.8958 0.6557 0.9702   0.8841 0.6739 0.07332 0.1786 0.03436
        27 0.8965 0.6400 0.9702   0.8796 0.6599 0.07667 0.1851 0.03578
        29 0.8979 0.6261 0.9733   0.8781 0.6535 0.07581 0.1809 0.04018
        31 0.8974 0.6293 0.9723   0.8781 0.6535 0.07555 0.1869 0.03596
        33 0.8930 0.6171 0.9713   0.8743 0.6413 0.07872 0.1890 0.03988
        35 0.8920 0.6264 0.9702   0.8758 0.6483 0.07962 0.1832 0.04122
        37 0.8937 0.6039 0.9682   0.8682 0.6243 0.07697 0.1925 0.04601
        39 0.8939 0.6014 0.9713   0.8697 0.6277 0.07408 0.1859 0.04277
        41 0.8925 0.5904 0.9712   0.8668 0.6185 0.09554 0.1772 0.04048
        43 0.8908 0.5875 0.9712   0.8660 0.6162 0.09824 0.1721 0.03759
        45 0.8972 0.5764 0.9723   0.8637 0.6053 0.07124 0.1903 0.03746
        47 0.8944 0.5850 0.9713   0.8651 0.6113 0.07291 0.1964 0.04522
        49 0.8958 0.5696 0.9723   0.8616 0.5983 0.07301 0.1975 0.04412
        51 0.8955 0.5554 0.9713   0.8570 0.5839 0.07303 0.1924 0.04265
        53 0.8924 0.5575 0.9702   0.8570 0.5852 0.07102 0.1884 0.04388
        55 0.8935 0.5439 0.9702   0.8532 0.5713 0.07456 0.1904 0.03765
        57 0.8929 0.5196 0.9713   0.8472 0.5517 0.07414 0.1831 0.04032
        59 0.8937 0.5446 0.9743   0.8562 0.5802 0.07611 0.1796 0.04076
        61 0.8925 0.5411 0.9753   0.8561 0.5760 0.07465 0.1976 0.03785
        63 0.8908 0.5450 0.9732   0.8556 0.5774 0.07633 0.1877 0.03788
        65 0.8951 0.5411 0.9743   0.8554 0.5769 0.07304 0.1798 0.03489
        67 0.8965 0.5246 0.9753   0.8517 0.5622 0.07274 0.1834 0.03474
        69 0.8957 0.5196 0.9764   0.8510 0.5592 0.07228 0.1859 0.03642
        71 0.8931 0.5118 0.9754   0.8481 0.5495 0.07485 0.1854 0.03303
        73 0.8907 0.5061 0.9764   0.8473 0.5459 0.07456 0.1851 0.03611
        75 0.8951 0.5061 0.9785   0.8488 0.5498 0.07156 0.1823 0.03111
        77 0.8920 0.5004 0.9722   0.8427 0.5363 0.07632 0.1682 0.03782
        79 0.8923 0.5139 0.9744   0.8481 0.5491 0.07085 0.1950 0.03900
        81 0.8943 0.5061 0.9795   0.8496 0.5508 0.07134 0.1868 0.03097
        83 0.8932 0.4946 0.9795   0.8465 0.5432 0.07049 0.1630 0.03103
        85 0.8927 0.4832 0.9795   0.8435 0.5304 0.07044 0.1722 0.03097
        87 0.8925 0.4914 0.9764   0.8436 0.5335 0.07259 0.1742 0.03483
        89 0.8923 0.4696 0.9774   0.8383 0.5130 0.07230 0.1757 0.03289
        91 0.8916 0.4889 0.9795   0.8450 0.5362 0.07351 0.1702 0.03448
        93 0.8929 0.4693 0.9785   0.8391 0.5158 0.07389 0.1686 0.03439
        95 0.8901 0.4779 0.9826   0.8443 0.5297 0.07283 0.1749 0.02858
        97 0.8918 0.4743 0.9805   0.8420 0.5236 0.07111 0.1713 0.03081
        99 0.8942 0.4800 0.9816   0.8443 0.5296 0.07325 0.1797 0.03055
       101 0.8930 0.4800 0.9805   0.8435 0.5282 0.07164 0.1792 0.02916
       103 0.8924 0.4629 0.9816   0.8397 0.5146 0.06889 0.1634 0.02864
       105 0.8918 0.4575 0.9816   0.8382 0.5089 0.07070 0.1712 0.03055
       107 0.8918 0.4586 0.9837   0.8398 0.5133 0.06979 0.1700 0.02607
       109 0.8942 0.4746 0.9815   0.8428 0.5256 0.06719 0.1710 0.02889
       111 0.8914 0.4632 0.9805   0.8390 0.5117 0.07184 0.1786 0.02916
       113 0.8928 0.4518 0.9826   0.8375 0.5033 0.07029 0.1803 0.02646
       115 0.8935 0.4529 0.9836   0.8383 0.5058 0.06754 0.1793 0.02614
       117 0.8933 0.4464 0.9815   0.8352 0.4978 0.06930 0.1652 0.02687
       119 0.8936 0.4657 0.9857   0.8435 0.5246 0.06720 0.1615 0.02522
       121 0.8891 0.4682 0.9816   0.8412 0.5190 0.07080 0.1736 0.02864
       123 0.8926 0.4418 0.9837   0.8353 0.4965 0.06742 0.1684 0.02796
       125 0.8894 0.4436 0.9847   0.8367 0.4987 0.06941 0.1764 0.02571
       127 0.8936 0.4518 0.9847   0.8390 0.5081 0.06928 0.1708 0.02571
       129 0.8889 0.4468 0.9836   0.8367 0.5003 0.06845 0.1749 0.02614
       131 0.8934 0.4346 0.9847   0.8344 0.4912 0.07038 0.1649 0.02571
       132 0.8877 0.4379 0.9847   0.8352 0.4933 0.07298 0.1726 0.02571
 AccuracySD KappaSD Selected
    0.06711  0.1847         
    0.06934  0.1855         
    0.05988  0.1679         
    0.06634  0.1892         
    0.06103  0.1797         
    0.05451  0.1591         
    0.05200  0.1546        *
    0.05678  0.1700         
    0.05461  0.1627         
    0.05903  0.1749         
    0.05494  0.1723         
    0.05715  0.1724         
    0.05202  0.1642         
    0.05537  0.1717         
    0.05820  0.1790         
    0.05602  0.1765         
    0.05933  0.1855         
    0.05394  0.1683         
    0.05929  0.1837         
    0.06028  0.1882         
    0.05597  0.1763         
    0.05423  0.1704         
    0.05549  0.1842         
    0.06042  0.1929         
    0.05635  0.1855         
    0.05377  0.1800         
    0.05516  0.1801         
    0.05398  0.1891         
    0.05712  0.1893         
    0.04935  0.1670         
    0.05440  0.1919         
    0.05082  0.1779         
    0.05206  0.1765         
    0.05270  0.1842         
    0.05400  0.1842         
    0.05191  0.1838         
    0.05377  0.1906         
    0.05491  0.1918         
    0.05021  0.1690         
    0.05564  0.1955         
    0.05507  0.1950         
    0.04912  0.1670         
    0.04941  0.1777         
    0.05320  0.1855         
    0.04916  0.1798         
    0.05029  0.1790         
    0.04891  0.1745         
    0.04844  0.1775         
    0.05162  0.1851         
    0.05033  0.1850         
    0.05291  0.1895         
    0.04500  0.1684         
    0.05046  0.1808         
    0.05023  0.1786         
    0.04958  0.1780         
    0.05316  0.1929         
    0.05182  0.1923         
    0.05036  0.1882         
    0.04590  0.1745         
    0.04543  0.1677         
    0.05059  0.1828         
    0.04866  0.1791         
    0.05019  0.1903         
    0.05073  0.1864         
    0.04952  0.1862         
    0.04700  0.1795         
    0.04872  0.1865         

The top 5 variables (out of 13):
   Ab_42, tau, p_tau, VEGF, FAS

> 
> ctrl$functions <- ldaFuncs
> ctrl$functions$summary <- fiveStats
> 
> set.seed(721)
> ldaRFE <- rfe(training[, predVars],
+               training$Class,
+               sizes = varSeq,
+               metric = "ROC",
+               tol = 1.0e-12,
+               rfeControl = ctrl)
> ldaRFE

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 

Resampling performance over subset size:

 Variables    ROC   Sens   Spec Accuracy  Kappa   ROCSD SensSD  SpecSD
         1 0.8483 0.6621 0.8795   0.8201 0.5385 0.08787 0.2009 0.07003
         3 0.8518 0.6243 0.8899   0.8172 0.5208 0.08546 0.2032 0.06947
         5 0.8509 0.6211 0.8979   0.8216 0.5278 0.08457 0.2038 0.06042
         7 0.8517 0.6154 0.9053   0.8255 0.5341 0.08337 0.2241 0.07148
         9 0.8513 0.6264 0.9043   0.8278 0.5424 0.08574 0.2168 0.06818
        11 0.8566 0.6318 0.9176   0.8391 0.5676 0.08869 0.2132 0.06175
        13 0.8818 0.6736 0.9311   0.8603 0.6256 0.08136 0.2104 0.06389
        15 0.8872 0.6779 0.9311   0.8617 0.6298 0.07954 0.1970 0.06091
        17 0.8900 0.6729 0.9215   0.8532 0.6129 0.07704 0.1885 0.07138
        19 0.8975 0.7050 0.9289   0.8676 0.6494 0.07703 0.1940 0.05814
        21 0.9004 0.7050 0.9299   0.8683 0.6503 0.07364 0.1971 0.05628
        23 0.9067 0.7125 0.9289   0.8698 0.6536 0.07214 0.2127 0.05903
        25 0.9109 0.7193 0.9279   0.8708 0.6589 0.06827 0.2026 0.05995
        27 0.9104 0.7350 0.9271   0.8745 0.6720 0.06855 0.1866 0.05694
        29 0.9128 0.7404 0.9322   0.8798 0.6846 0.06828 0.1834 0.05535
        31 0.9128 0.7346 0.9217   0.8706 0.6632 0.06917 0.1819 0.05352
        33 0.9157 0.7429 0.9279   0.8774 0.6790 0.06941 0.1854 0.05217
        35 0.9163 0.7407 0.9217   0.8721 0.6678 0.06746 0.1848 0.05660
        37 0.9131 0.7436 0.9187   0.8706 0.6654 0.06615 0.1861 0.05812
        39 0.9126 0.7461 0.9187   0.8714 0.6679 0.06456 0.1853 0.05899
        41 0.9149 0.7436 0.9155   0.8684 0.6610 0.06764 0.1843 0.06073
        43 0.9131 0.7486 0.9145   0.8691 0.6630 0.06749 0.1872 0.05956
        45 0.9145 0.7539 0.9094   0.8669 0.6606 0.06560 0.1719 0.05729
        47 0.9109 0.7411 0.9011   0.8572 0.6369 0.06528 0.1747 0.05511
        49 0.9119 0.7471 0.9021   0.8595 0.6426 0.06766 0.1817 0.05519
        51 0.9110 0.7471 0.9031   0.8601 0.6430 0.06583 0.1885 0.05267
        53 0.9098 0.7443 0.9043   0.8601 0.6427 0.06406 0.1934 0.06022
        55 0.9082 0.7300 0.9012   0.8541 0.6261 0.06495 0.1950 0.05753
        57 0.9075 0.7350 0.9054   0.8586 0.6367 0.06390 0.1997 0.06148
        59 0.9056 0.7357 0.9115   0.8632 0.6464 0.06710 0.1977 0.05784
        61 0.9082 0.7357 0.9095   0.8617 0.6448 0.06461 0.1885 0.06244
        63 0.9087 0.7300 0.9065   0.8579 0.6364 0.06374 0.1890 0.06966
        65 0.9073 0.7364 0.9036   0.8573 0.6360 0.06500 0.1967 0.06857
        67 0.9043 0.7411 0.9045   0.8595 0.6429 0.06666 0.1847 0.06917
        69 0.8989 0.7414 0.9005   0.8566 0.6363 0.07321 0.1916 0.07001
        71 0.8989 0.7386 0.9003   0.8557 0.6332 0.07140 0.1973 0.07053
        73 0.8980 0.7332 0.9003   0.8542 0.6301 0.07119 0.1840 0.06976
        75 0.8954 0.7354 0.8953   0.8514 0.6275 0.07105 0.1649 0.07786
        77 0.8931 0.7354 0.8899   0.8475 0.6193 0.07323 0.1623 0.07480
        79 0.8911 0.7461 0.8818   0.8445 0.6163 0.07300 0.1430 0.07030
        81 0.8878 0.7489 0.8848   0.8474 0.6235 0.06987 0.1453 0.07379
        83 0.8856 0.7382 0.8733   0.8360 0.5990 0.06906 0.1441 0.08200
        85 0.8836 0.7350 0.8766   0.8376 0.6003 0.07030 0.1485 0.07922
        87 0.8825 0.7296 0.8766   0.8362 0.5961 0.07112 0.1482 0.07726
        89 0.8831 0.7189 0.8726   0.8304 0.5801 0.07049 0.1561 0.07203
        91 0.8813 0.7293 0.8694   0.8310 0.5855 0.07322 0.1527 0.07665
        93 0.8778 0.7236 0.8755   0.8340 0.5893 0.07479 0.1585 0.07555
        95 0.8749 0.7339 0.8745   0.8362 0.5961 0.09342 0.1570 0.07480
        97 0.8827 0.7282 0.8743   0.8345 0.5922 0.07452 0.1570 0.07919
        99 0.8822 0.7371 0.8733   0.8362 0.5959 0.07307 0.1522 0.06643
       101 0.8843 0.7196 0.8765   0.8339 0.5853 0.07526 0.1620 0.06258
       103 0.8808 0.7164 0.8693   0.8278 0.5714 0.07495 0.1717 0.06534
       105 0.8787 0.7318 0.8672   0.8301 0.5805 0.07423 0.1651 0.06133
       107 0.8746 0.7096 0.8682   0.8249 0.5651 0.07805 0.1584 0.06424
       109 0.8679 0.7036 0.8673   0.8227 0.5589 0.09389 0.1616 0.06383
       111 0.8688 0.7064 0.8702   0.8257 0.5653 0.07951 0.1644 0.06510
       113 0.8635 0.7182 0.8652   0.8251 0.5678 0.08714 0.1687 0.06935
       115 0.8623 0.6993 0.8577   0.8145 0.5415 0.09186 0.1710 0.06934
       117 0.8586 0.6968 0.8516   0.8093 0.5307 0.09189 0.1724 0.06903
       119 0.8570 0.6979 0.8518   0.8099 0.5323 0.09064 0.1825 0.07745
       121 0.8581 0.7093 0.8508   0.8121 0.5403 0.08832 0.1768 0.07823
       123 0.8559 0.6957 0.8477   0.8064 0.5241 0.08573 0.1852 0.07355
       125 0.8507 0.6907 0.8404   0.7996 0.5096 0.09223 0.1859 0.07252
       127 0.8439 0.6771 0.8405   0.7959 0.4979 0.08763 0.1894 0.07237
       129 0.8418 0.6739 0.8313   0.7883 0.4827 0.08636 0.1879 0.07310
       131 0.8439 0.6857 0.8294   0.7900 0.4910 0.08593 0.1803 0.08189
       132 0.8439 0.6857 0.8294   0.7900 0.4910 0.08593 0.1803 0.08189
 AccuracySD KappaSD Selected
    0.06212  0.1693         
    0.06092  0.1694         
    0.05878  0.1723         
    0.07356  0.2082         
    0.06843  0.1960         
    0.06755  0.1931         
    0.07173  0.1998         
    0.06132  0.1738         
    0.06709  0.1779         
    0.06469  0.1773         
    0.05814  0.1630         
    0.06293  0.1778         
    0.06107  0.1678         
    0.05941  0.1600         
    0.05484  0.1485         
    0.05414  0.1480         
    0.05343  0.1499         
    0.05565  0.1534        *
    0.05921  0.1614         
    0.05946  0.1604         
    0.05870  0.1577         
    0.05789  0.1585         
    0.05406  0.1445         
    0.05679  0.1521         
    0.05793  0.1556         
    0.05796  0.1586         
    0.05942  0.1618         
    0.06066  0.1687         
    0.06102  0.1704         
    0.05765  0.1624         
    0.05590  0.1549         
    0.05753  0.1524         
    0.06565  0.1722         
    0.06434  0.1671         
    0.06391  0.1670         
    0.06698  0.1765         
    0.06320  0.1639         
    0.06074  0.1485         
    0.06176  0.1516         
    0.05911  0.1417         
    0.06292  0.1510         
    0.06461  0.1510         
    0.05977  0.1408         
    0.06272  0.1476         
    0.06231  0.1524         
    0.06917  0.1646         
    0.06792  0.1640         
    0.06741  0.1630         
    0.07049  0.1658         
    0.06348  0.1552         
    0.06082  0.1547         
    0.06131  0.1588         
    0.05867  0.1505         
    0.05767  0.1460         
    0.06143  0.1582         
    0.05859  0.1501         
    0.06193  0.1546         
    0.06218  0.1573         
    0.06459  0.1642         
    0.07084  0.1784         
    0.06828  0.1704         
    0.07060  0.1793         
    0.06999  0.1790         
    0.06957  0.1782         
    0.07027  0.1782         
    0.06589  0.1611         
    0.06589  0.1611         

The top 5 variables (out of 35):
   Ab_42, tau, p_tau, MMP10, MIF

> 
> ctrl$functions <- nbFuncs
> ctrl$functions$summary <- fiveStats
> set.seed(721)
> nbRFE <- rfe(training[, predVars],
+               training$Class,
+               sizes = varSeq,
+               metric = "ROC",
+               rfeControl = ctrl)
> nbRFE

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 

Resampling performance over subset size:

 Variables    ROC   Sens   Spec Accuracy  Kappa   ROCSD SensSD  SpecSD
         1 0.8219 0.6286 0.8806   0.8112 0.5133 0.09390 0.1858 0.07246
         3 0.8260 0.6171 0.8537   0.7886 0.4655 0.08952 0.1996 0.08506
         5 0.8176 0.6200 0.8374   0.7774 0.4472 0.08568 0.1868 0.08760
         7 0.8171 0.6107 0.8355   0.7737 0.4368 0.08333 0.1784 0.08128
         9 0.8152 0.6093 0.8274   0.7672 0.4248 0.08766 0.1798 0.08402
        11 0.8197 0.6143 0.8325   0.7723 0.4370 0.08881 0.1644 0.08098
        13 0.8264 0.6532 0.8348   0.7845 0.4720 0.08559 0.1782 0.08413
        15 0.8274 0.6582 0.8325   0.7844 0.4725 0.08184 0.1732 0.07366
        17 0.8318 0.6807 0.8387   0.7950 0.5000 0.08452 0.1690 0.07622
        19 0.8314 0.6671 0.8437   0.7948 0.4955 0.08804 0.1804 0.08144
        21 0.8294 0.6589 0.8426   0.7918 0.4866 0.08704 0.1847 0.08180
        23 0.8275 0.6457 0.8457   0.7904 0.4788 0.09091 0.1952 0.08045
        25 0.8280 0.6436 0.8404   0.7859 0.4697 0.09197 0.1937 0.07888
        27 0.8307 0.6436 0.8456   0.7896 0.4766 0.09182 0.1942 0.07845
        29 0.8291 0.6300 0.8446   0.7852 0.4643 0.09237 0.1952 0.08216
        31 0.8229 0.6182 0.8416   0.7799 0.4508 0.09497 0.1859 0.08083
        33 0.8222 0.6182 0.8386   0.7777 0.4481 0.08826 0.1859 0.08690
        35 0.8185 0.6264 0.8345   0.7769 0.4487 0.09244 0.1806 0.08364
        37 0.8165 0.6243 0.8344   0.7761 0.4454 0.09084 0.1894 0.08191
        39 0.8147 0.6214 0.8324   0.7740 0.4414 0.09174 0.1928 0.08403
        41 0.8113 0.6139 0.8244   0.7659 0.4251 0.09145 0.1896 0.08912
        43 0.8106 0.6111 0.8264   0.7667 0.4251 0.08928 0.1869 0.08561
        45 0.8078 0.6025 0.8212   0.7606 0.4105 0.09236 0.1962 0.08997
        47 0.8031 0.5971 0.8191   0.7576 0.4035 0.09325 0.1960 0.09122
        49 0.8006 0.6021 0.8169   0.7574 0.4048 0.09371 0.1918 0.08948
        51 0.7942 0.5993 0.8096   0.7514 0.3923 0.09367 0.1954 0.08918
        53 0.7942 0.6021 0.8067   0.7500 0.3922 0.09352 0.1929 0.09279
        55 0.7924 0.6025 0.8047   0.7486 0.3897 0.09154 0.1962 0.09277
        57 0.7910 0.5968 0.8037   0.7463 0.3835 0.09229 0.1991 0.09369
        59 0.7905 0.5939 0.8016   0.7441 0.3782 0.09206 0.1984 0.09330
        61 0.7885 0.6054 0.8005   0.7463 0.3872 0.09605 0.1925 0.09374
        63 0.7856 0.6025 0.8035   0.7477 0.3882 0.09639 0.1929 0.09137
        65 0.7853 0.5993 0.7953   0.7409 0.3750 0.09680 0.1954 0.09268
        67 0.7839 0.5996 0.7984   0.7432 0.3796 0.09714 0.1934 0.09261
        69 0.7824 0.5943 0.7994   0.7425 0.3760 0.09728 0.1898 0.09267
        71 0.7787 0.5996 0.7973   0.7425 0.3792 0.10154 0.1845 0.09806
        73 0.7791 0.6025 0.7973   0.7432 0.3809 0.10094 0.1851 0.09763
        75 0.7794 0.5996 0.7942   0.7402 0.3745 0.10232 0.1901 0.09662
        77 0.7792 0.6018 0.7973   0.7432 0.3811 0.10076 0.1810 0.09632
        79 0.7786 0.6100 0.7972   0.7453 0.3875 0.10145 0.1815 0.09363
        81 0.7783 0.6150 0.7973   0.7469 0.3928 0.10362 0.1801 0.09841
        83 0.7785 0.6100 0.7953   0.7440 0.3859 0.10308 0.1833 0.09937
        85 0.7799 0.6043 0.7953   0.7424 0.3807 0.10384 0.1844 0.09601
        87 0.7798 0.6096 0.7984   0.7462 0.3895 0.10427 0.1826 0.09587
        89 0.7796 0.6096 0.7953   0.7439 0.3859 0.10255 0.1826 0.09853
        91 0.7803 0.6043 0.7974   0.7439 0.3838 0.10025 0.1850 0.09901
        93 0.7813 0.6071 0.8025   0.7484 0.3926 0.09988 0.1861 0.09907
        95 0.7819 0.6046 0.8014   0.7469 0.3886 0.09946 0.1900 0.09792
        97 0.7841 0.5989 0.8025   0.7462 0.3852 0.09933 0.1877 0.09702
        99 0.7844 0.5986 0.8025   0.7462 0.3862 0.09856 0.1808 0.09932
       101 0.7856 0.5982 0.8066   0.7492 0.3906 0.09764 0.1842 0.09638
       103 0.7865 0.6007 0.8066   0.7499 0.3934 0.09868 0.1808 0.09638
       105 0.7880 0.6032 0.8097   0.7529 0.3997 0.09868 0.1785 0.09609
       107 0.7881 0.5982 0.8046   0.7476 0.3880 0.09737 0.1876 0.09781
       109 0.7909 0.5954 0.8026   0.7454 0.3825 0.09565 0.1869 0.09620
       111 0.7898 0.5929 0.8036   0.7454 0.3814 0.09557 0.1885 0.09590
       113 0.7914 0.5954 0.8057   0.7476 0.3865 0.09535 0.1864 0.09604
       115 0.7939 0.5982 0.8108   0.7522 0.3961 0.09499 0.1826 0.09499
       117 0.7969 0.5986 0.8118   0.7530 0.3980 0.09534 0.1855 0.09748
       119 0.7948 0.6039 0.8108   0.7537 0.4019 0.09458 0.1827 0.09978
       121 0.7962 0.5986 0.8118   0.7529 0.3990 0.09327 0.1725 0.09878
       123 0.7993 0.5986 0.8108   0.7522 0.3978 0.09368 0.1725 0.09916
       125 0.7999 0.6039 0.8108   0.7537 0.4020 0.09421 0.1733 0.09732
       127 0.7987 0.6014 0.8118   0.7538 0.4014 0.09424 0.1683 0.09737
       129 0.7968 0.6014 0.8108   0.7530 0.4001 0.09664 0.1683 0.09732
       131 0.7980 0.5936 0.8139   0.7530 0.3966 0.09522 0.1742 0.09706
       132 0.7980 0.5936 0.8139   0.7530 0.3966 0.09522 0.1742 0.09706
 AccuracySD KappaSD Selected
    0.05981  0.1591         
    0.07148  0.1822         
    0.06939  0.1721         
    0.06715  0.1668         
    0.06916  0.1679         
    0.06712  0.1629         
    0.07135  0.1725         
    0.07058  0.1762         
    0.06595  0.1587        *
    0.07284  0.1747         
    0.07193  0.1740         
    0.07296  0.1821         
    0.07436  0.1849         
    0.07286  0.1808         
    0.07526  0.1843         
    0.07277  0.1781         
    0.07880  0.1876         
    0.07058  0.1718         
    0.07097  0.1773         
    0.08000  0.1952         
    0.07929  0.1885         
    0.07735  0.1855         
    0.07910  0.1935         
    0.08068  0.1953         
    0.07778  0.1875         
    0.07609  0.1850         
    0.07929  0.1904         
    0.08009  0.1924         
    0.08398  0.2005         
    0.08018  0.1921         
    0.08105  0.1910         
    0.07906  0.1878         
    0.07920  0.1881         
    0.08124  0.1907         
    0.07596  0.1770         
    0.07690  0.1735         
    0.07299  0.1650         
    0.07609  0.1751         
    0.07310  0.1658         
    0.07230  0.1677         
    0.07457  0.1679         
    0.07593  0.1697         
    0.07698  0.1750         
    0.07872  0.1803         
    0.07928  0.1801         
    0.07925  0.1805         
    0.07853  0.1784         
    0.08007  0.1844         
    0.08021  0.1863         
    0.08105  0.1845         
    0.08156  0.1900         
    0.08351  0.1923         
    0.08267  0.1898         
    0.08395  0.1959         
    0.08036  0.1882         
    0.08116  0.1892         
    0.08057  0.1877         
    0.08211  0.1924         
    0.08577  0.2000         
    0.08526  0.1957         
    0.08264  0.1883         
    0.08220  0.1868         
    0.08051  0.1835         
    0.07932  0.1797         
    0.07973  0.1809         
    0.07924  0.1821         
    0.07924  0.1821         

The top 5 variables (out of 17):
   Ab_42, tau, p_tau, MMP10, MIF

> 
> ## Here, the caretFuncs list allows for a model to be tuned at each iteration 
> ## of feature seleciton.
> 
> ctrl$functions <- caretFuncs
> ctrl$functions$summary <- fiveStats
> 
> ## This options tells train() to run it's model tuning
> ## sequentially. Otherwise, there would be parallel processing at two
> ## levels, which is possible but requires W^2 workers. On our machine,
> ## it was more efficient to only run the RFE process in parallel. 
> 
> cvCtrl <- trainControl(method = "cv",
+                        verboseIter = FALSE,
+                        classProbs = TRUE,
+                        allowParallel = FALSE)
> 
> set.seed(721)
> svmRFE <- rfe(training[, predVars],
+               training$Class,
+               sizes = varSeq,
+               rfeControl = ctrl,
+               metric = "ROC",
+               ## Now arguments to train() are used.
+               method = "svmRadial",
+               tuneLength = 12,
+               preProc = c("center", "scale"),
+               trControl = cvCtrl)
> svmRFE

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 

Resampling performance over subset size:

 Variables    ROC     Sens   Spec Accuracy     Kappa   ROCSD  SensSD   SpecSD
         1 0.5905 0.000000 0.9958   0.7237 -0.005263 0.09085 0.00000 0.023393
         3 0.6070 0.005357 0.9927   0.7229 -0.002947 0.08997 0.02657 0.018319
         5 0.5768 0.000000 0.9979   0.7252 -0.002887 0.09124 0.00000 0.010418
         7 0.5708 0.002500 0.9989   0.7267  0.001849 0.09078 0.01768 0.007443
         9 0.6014 0.000000 0.9969   0.7245 -0.004216 0.09225 0.00000 0.012209
        11 0.6174 0.002500 0.9917   0.7215 -0.007576 0.08716 0.01768 0.026643
        13 0.6005 0.000000 0.9959   0.7237 -0.005572 0.07947 0.00000 0.013886
        15 0.6089 0.008571 0.9806   0.7148 -0.014105 0.10215 0.03427 0.038157
        17 0.6058 0.013929 0.9826   0.7178 -0.004414 0.10541 0.04227 0.035314
        19 0.6158 0.005714 0.9929   0.7230 -0.002504 0.08359 0.04041 0.022820
        21 0.6314 0.005357 0.9855   0.7177 -0.012255 0.08684 0.02657 0.031549
        23 0.6417 0.000000 0.9897   0.7192 -0.013806 0.10571 0.00000 0.025491
        25 0.6309 0.029286 0.9846   0.7237  0.018380 0.09551 0.07095 0.040661
        27 0.6458 0.023929 0.9919   0.7275  0.019326 0.09345 0.08064 0.021203
        29 0.6488 0.054286 0.9859   0.7314  0.053679 0.09500 0.07627 0.030503
        31 0.6480 0.053929 0.9809   0.7276  0.046498 0.09630 0.07589 0.042807
        33 0.6725 0.062143 0.9601   0.7148  0.029185 0.09998 0.08970 0.059497
        35 0.7088 0.153571 0.9467   0.7298  0.123175 0.10956 0.12796 0.061985
        37 0.7022 0.168929 0.9486   0.7356  0.148679 0.10998 0.10407 0.061053
        39 0.7493 0.326786 0.9245   0.7610  0.291271 0.10438 0.14757 0.065438
        41 0.7538 0.334643 0.9267   0.7650  0.304769 0.09439 0.12942 0.057705
        43 0.7654 0.382500 0.9164   0.7703  0.336838 0.10869 0.16560 0.074757
        45 0.7888 0.448571 0.9062   0.7814  0.388675 0.08715 0.16468 0.077762
        47 0.7972 0.468929 0.9022   0.7839  0.402476 0.08719 0.16326 0.072602
        49 0.8024 0.452857 0.9137   0.7878  0.400875 0.08017 0.17391 0.067174
        51 0.8041 0.464643 0.9136   0.7909  0.415948 0.08680 0.15094 0.067200
        53 0.7975 0.452857 0.9127   0.7870  0.401976 0.08324 0.16405 0.067241
        55 0.7853 0.413214 0.9009   0.7680  0.348189 0.08213 0.14752 0.077139
        57 0.7844 0.438929 0.9135   0.7837  0.390411 0.07783 0.15230 0.070724
        59 0.7870 0.415714 0.8961   0.7644  0.337122 0.07936 0.18739 0.072832
        61 0.7980 0.450714 0.9074   0.7824  0.388683 0.08052 0.17584 0.065940
        63 0.7826 0.421786 0.9082   0.7748  0.361639 0.08169 0.17747 0.063917
        65 0.7864 0.443929 0.9002   0.7751  0.372336 0.08733 0.18343 0.070610
        67 0.7906 0.460000 0.8948   0.7756  0.384999 0.08865 0.14952 0.076753
        69 0.7866 0.434286 0.9051   0.7763  0.371835 0.08833 0.16216 0.064338
        71 0.7906 0.456071 0.9035   0.7810  0.392684 0.09026 0.14814 0.064754
        73 0.7866 0.418929 0.9075   0.7736  0.357854 0.09090 0.17686 0.065723
        75 0.7833 0.429286 0.9002   0.7711  0.361158 0.09241 0.14791 0.064189
        77 0.7918 0.420714 0.9021   0.7706  0.354054 0.09031 0.16442 0.061460
        79 0.7923 0.432500 0.9053   0.7759  0.369002 0.09370 0.16919 0.062353
        81 0.8004 0.471071 0.8983   0.7817  0.397339 0.08313 0.16998 0.064611
        83 0.8019 0.465357 0.8994   0.7808  0.392492 0.09808 0.17318 0.066776
        85 0.8168 0.515357 0.8972   0.7929  0.436919 0.07805 0.16860 0.064093
        87 0.8209 0.498214 0.8983   0.7892  0.423366 0.07254 0.16296 0.061885
        89 0.8242 0.512857 0.8982   0.7930  0.435858 0.07735 0.18215 0.063687
        91 0.8274 0.502500 0.8973   0.7893  0.425311 0.07538 0.17414 0.062171
        93 0.8262 0.497857 0.9063   0.7944  0.432279 0.08008 0.17617 0.054245
        95 0.8206 0.497143 0.9064   0.7945  0.434122 0.07848 0.15594 0.055060
        97 0.8232 0.488929 0.9105   0.7950  0.430581 0.07814 0.17154 0.055436
        99 0.8223 0.500000 0.9075   0.7959  0.437668 0.07680 0.16779 0.062664
       101 0.8218 0.504286 0.9054   0.7958  0.436402 0.07946 0.18323 0.059794
       103 0.8279 0.536429 0.9085   0.8063  0.471852 0.08184 0.18360 0.064848
       105 0.8267 0.543571 0.9014   0.8034  0.470228 0.08120 0.16639 0.065705
       107 0.8251 0.541071 0.9064   0.8063  0.472569 0.07694 0.18172 0.059117
       109 0.8268 0.551429 0.9034   0.8071  0.480250 0.07694 0.16333 0.063430
       111 0.8179 0.527143 0.9002   0.7981  0.452569 0.08383 0.17475 0.065273
       113 0.8156 0.522143 0.9025   0.7984  0.452796 0.08433 0.16804 0.067664
       115 0.8138 0.510357 0.9075   0.7989  0.447754 0.08722 0.16841 0.063067
       117 0.8131 0.528571 0.9013   0.7995  0.455319 0.08265 0.17139 0.063467
       119 0.8188 0.532857 0.9095   0.8064  0.471617 0.08475 0.16431 0.060784
       121 0.8225 0.533571 0.9044   0.8026  0.464322 0.08613 0.17599 0.068058
       123 0.8245 0.538571 0.9022   0.8026  0.466771 0.08876 0.17423 0.071140
       125 0.8815 0.680000 0.9343   0.8647  0.639461 0.08004 0.16685 0.055800
       127 0.8912 0.701786 0.9282   0.8661  0.649271 0.07635 0.16151 0.066473
       129 0.8900 0.701429 0.9302   0.8676  0.652656 0.07869 0.16370 0.066277
       131 0.8914 0.691429 0.9302   0.8646  0.643526 0.07691 0.16485 0.063667
       132 0.8893 0.674286 0.9322   0.8616  0.633022 0.07449 0.15082 0.058051
 AccuracySD KappaSD Selected
    0.02203 0.02882         
    0.01749 0.03100         
    0.01499 0.01428         
    0.01339 0.01307         
    0.01455 0.01686         
    0.02500 0.04221         
    0.01587 0.01909         
    0.02934 0.05591         
    0.02970 0.06116         
    0.01998 0.03986         
    0.02564 0.04107         
    0.02283 0.03372         
    0.03352 0.08812         
    0.02360 0.08907         
    0.02977 0.09988         
    0.03099 0.09088         
    0.04439 0.11421         
    0.04207 0.13477         
    0.04654 0.13361         
    0.05764 0.16761         
    0.05818 0.16646         
    0.06259 0.18393         
    0.06600 0.18597         
    0.06204 0.17159         
    0.05692 0.17232         
    0.06115 0.17055         
    0.06655 0.18785         
    0.06179 0.17173         
    0.06063 0.16750         
    0.06290 0.19192         
    0.05784 0.17611         
    0.06131 0.18280         
    0.06970 0.20398         
    0.06393 0.16434         
    0.06248 0.17773         
    0.06188 0.16790         
    0.06071 0.18004         
    0.05861 0.15959         
    0.06317 0.17973         
    0.06094 0.17690         
    0.06597 0.18691         
    0.06481 0.19130         
    0.05888 0.16835         
    0.05787 0.16558         
    0.07162 0.20140         
    0.06980 0.19570         
    0.06206 0.18188         
    0.05792 0.16732         
    0.05728 0.16662         
    0.06128 0.17683         
    0.06266 0.18665         
    0.06926 0.19777         
    0.06456 0.17840         
    0.06237 0.18096         
    0.06472 0.17957         
    0.06687 0.18908         
    0.06858 0.18914         
    0.06080 0.17232         
    0.06273 0.17956         
    0.06103 0.17401         
    0.06788 0.19012         
    0.06961 0.18870         
    0.06327 0.17249         
    0.05461 0.14531         
    0.06171 0.16233         
    0.05834 0.15316        *
    0.05327 0.14264         

The top 5 variables (out of 131):
   Ab_42, tau, p_tau, MMP10, MIF

> 
> ctrl$functions <- lrFuncs
> ctrl$functions$summary <- fiveStats
> 
> set.seed(721)
> lrRFE <- rfe(training[, predVars],
+                training$Class,
+                sizes = varSeq,
+                metric = "ROC",
+                rfeControl = ctrl)
> lrRFE

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 

Resampling performance over subset size:

 Variables    ROC   Sens   Spec Accuracy  Kappa   ROCSD SensSD  SpecSD
         1 0.7600 0.3325 0.9313   0.7675 0.2868 0.13224 0.2611 0.06965
         3 0.7787 0.4489 0.9054   0.7800 0.3692 0.13636 0.2786 0.07627
         5 0.8002 0.5332 0.9148   0.8099 0.4651 0.14821 0.2762 0.07436
         7 0.8300 0.6118 0.9067   0.8258 0.5318 0.12810 0.2435 0.07093
         9 0.8497 0.6425 0.9035   0.8317 0.5561 0.10148 0.2136 0.06977
        11 0.8550 0.6589 0.9062   0.8381 0.5792 0.09568 0.1699 0.06843
        13 0.8571 0.6536 0.9053   0.8361 0.5732 0.09524 0.1703 0.06620
        15 0.8543 0.6679 0.9000   0.8361 0.5778 0.10808 0.1706 0.07023
        17 0.8529 0.6729 0.8837   0.8257 0.5564 0.10130 0.1704 0.06520
        19 0.8562 0.6696 0.8866   0.8273 0.5562 0.10051 0.1889 0.06399
        21 0.8515 0.6614 0.8826   0.8220 0.5445 0.10679 0.1867 0.06815
        23 0.8473 0.6811 0.8703   0.8183 0.5447 0.10444 0.1812 0.07585
        25 0.8523 0.6775 0.8682   0.8160 0.5386 0.10208 0.1808 0.07375
        27 0.8448 0.6775 0.8651   0.8138 0.5336 0.10287 0.1871 0.07084
        29 0.8369 0.6914 0.8621   0.8153 0.5406 0.11255 0.1918 0.07864
        31 0.8172 0.6743 0.8518   0.8032 0.5125 0.14429 0.1901 0.07863
        33 0.8239 0.6804 0.8417   0.7974 0.5029 0.10737 0.1839 0.07630
        35 0.7846 0.6850 0.8249   0.7866 0.4862 0.14152 0.1684 0.07715
        37 0.7456 0.6629 0.8212   0.7778 0.4625 0.15954 0.1755 0.07874
        39 0.7291 0.6646 0.8136   0.7732 0.4540 0.15947 0.1853 0.08543
        41 0.7472 0.6707 0.8197   0.7792 0.4659 0.13699 0.1816 0.07805
        43 0.7364 0.6468 0.8153   0.7691 0.4400 0.14810 0.1897 0.08137
        45 0.7636 0.6746 0.8003   0.7657 0.4450 0.10668 0.1683 0.09067
        47 0.7619 0.6904 0.8011   0.7706 0.4602 0.12478 0.1685 0.09794
        49 0.7720 0.6782 0.8156   0.7776 0.4673 0.11553 0.1853 0.09389
        51 0.7819 0.7029 0.8099   0.7800 0.4813 0.11128 0.1693 0.09576
        53 0.7836 0.6939 0.8213   0.7860 0.4916 0.11668 0.1542 0.09829
        55 0.7984 0.7000 0.8159   0.7838 0.4902 0.08453 0.1478 0.10211
        57 0.7741 0.6768 0.8151   0.7765 0.4683 0.12412 0.1706 0.10082
        59 0.7795 0.6657 0.8119   0.7710 0.4551 0.12299 0.1737 0.10371
        61 0.7921 0.6743 0.8189   0.7786 0.4707 0.10119 0.1800 0.09823
        63 0.7885 0.6757 0.8024   0.7674 0.4501 0.10087 0.1745 0.09314
        65 0.7939 0.6786 0.8106   0.7740 0.4637 0.10055 0.1827 0.10282
        67 0.7955 0.6511 0.8046   0.7621 0.4327 0.09315 0.1860 0.09935
        69 0.7980 0.6871 0.8036   0.7712 0.4634 0.10358 0.1645 0.10550
        71 0.7881 0.6864 0.7944   0.7645 0.4525 0.10688 0.1845 0.11695
        73 0.7837 0.6632 0.7944   0.7577 0.4294 0.10418 0.1899 0.10392
        75 0.7841 0.6668 0.7923   0.7570 0.4286 0.10367 0.1970 0.10784
        77 0.7805 0.6682 0.7961   0.7605 0.4370 0.10579 0.1826 0.11082
        79 0.7812 0.6696 0.7985   0.7628 0.4430 0.10462 0.1768 0.11188
        81 0.7837 0.6621 0.7901   0.7545 0.4259 0.09616 0.1793 0.11344
        83 0.7837 0.6486 0.7881   0.7493 0.4109 0.09257 0.1870 0.11317
        85 0.7843 0.6600 0.7858   0.7508 0.4192 0.09711 0.1624 0.11135
        87 0.7869 0.6350 0.7870   0.7447 0.3995 0.08773 0.1785 0.11444
        89 0.7912 0.6679 0.7838   0.7514 0.4236 0.08960 0.1570 0.11414
        91 0.7962 0.6764 0.7851   0.7545 0.4329 0.08569 0.1592 0.11996
        93 0.7918 0.6875 0.7828   0.7559 0.4353 0.08811 0.1699 0.10608
        95 0.7920 0.6689 0.7768   0.7463 0.4130 0.08550 0.1707 0.10948
        97 0.7834 0.6632 0.7791   0.7463 0.4117 0.09253 0.1660 0.11186
        99 0.7832 0.6657 0.7747   0.7438 0.4072 0.08899 0.1731 0.10941
       101 0.7851 0.6679 0.7778   0.7470 0.4150 0.09378 0.1672 0.11201
       103 0.7876 0.6682 0.7758   0.7455 0.4119 0.09109 0.1716 0.11139
       105 0.7872 0.6725 0.7842   0.7529 0.4282 0.09882 0.1589 0.11508
       107 0.7869 0.6775 0.7852   0.7552 0.4330 0.10293 0.1613 0.11178
       109 0.7845 0.6664 0.7841   0.7515 0.4235 0.11155 0.1595 0.11597
       111 0.7831 0.6646 0.7746   0.7440 0.4099 0.10095 0.1708 0.11756
       113 0.7830 0.6646 0.7788   0.7470 0.4131 0.09778 0.1708 0.10983
       115 0.7841 0.6643 0.7778   0.7462 0.4123 0.09882 0.1659 0.11286
       117 0.7827 0.6696 0.7819   0.7507 0.4220 0.10605 0.1594 0.10893
       119 0.7831 0.6675 0.7831   0.7508 0.4195 0.10265 0.1760 0.10406
       121 0.7848 0.6721 0.7779   0.7485 0.4188 0.10165 0.1679 0.11203
       123 0.7839 0.6675 0.7779   0.7471 0.4147 0.10471 0.1686 0.11040
       125 0.7822 0.6696 0.7779   0.7478 0.4175 0.10507 0.1696 0.11471
       127 0.7818 0.6696 0.7779   0.7479 0.4173 0.10490 0.1632 0.10984
       129 0.7825 0.6693 0.7788   0.7485 0.4179 0.10320 0.1659 0.10946
       131 0.7846 0.6696 0.7779   0.7478 0.4170 0.10057 0.1652 0.10989
       132 0.7846 0.6696 0.7779   0.7478 0.4170 0.10057 0.1652 0.10989
 AccuracySD KappaSD Selected
    0.07066  0.2617         
    0.08671  0.2829         
    0.08625  0.2690         
    0.09111  0.2633         
    0.07383  0.2043         
    0.06849  0.1733         
    0.06522  0.1667        *
    0.06775  0.1702         
    0.06406  0.1631         
    0.06541  0.1750         
    0.06900  0.1806         
    0.07172  0.1767         
    0.07361  0.1815         
    0.07640  0.1907         
    0.07496  0.1864         
    0.07783  0.1932         
    0.06920  0.1712         
    0.07455  0.1774         
    0.07537  0.1831         
    0.07624  0.1800         
    0.06610  0.1632         
    0.07209  0.1740         
    0.05611  0.1242         
    0.06826  0.1483         
    0.06754  0.1534         
    0.06433  0.1383         
    0.06896  0.1485         
    0.06929  0.1391         
    0.07736  0.1652         
    0.07985  0.1716         
    0.08330  0.1842         
    0.08009  0.1732         
    0.08553  0.1853         
    0.08245  0.1841         
    0.07897  0.1662         
    0.09083  0.1910         
    0.07839  0.1723         
    0.07590  0.1653         
    0.07576  0.1576         
    0.08382  0.1767         
    0.08406  0.1725         
    0.08361  0.1763         
    0.07799  0.1526         
    0.08361  0.1739         
    0.07821  0.1536         
    0.08338  0.1619         
    0.07202  0.1492         
    0.07270  0.1453         
    0.07160  0.1409         
    0.07181  0.1454         
    0.08167  0.1649         
    0.07988  0.1632         
    0.08179  0.1628         
    0.08254  0.1655         
    0.08433  0.1691         
    0.09071  0.1863         
    0.08187  0.1732         
    0.08055  0.1652         
    0.08032  0.1643         
    0.08034  0.1761         
    0.08198  0.1721         
    0.08217  0.1736         
    0.08674  0.1802         
    0.08394  0.1752         
    0.08337  0.1753         
    0.08250  0.1718         
    0.08250  0.1718         

The top 5 variables (out of 13):
   tau, Cortisol, VEGF, Clusterin_Apo_J, Fetuin_A

> 
> ctrl$functions <- caretFuncs
> ctrl$functions$summary <- fiveStats
> 
> set.seed(721)
> knnRFE <- rfe(training[, predVars],
+               training$Class,
+               sizes = varSeq,
+               metric = "ROC",
+               method = "knn",
+               tuneLength = 20,
+               preProc = c("center", "scale"),
+               trControl = cvCtrl,
+               rfeControl = ctrl)
> knnRFE

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 

Resampling performance over subset size:

 Variables    ROC     Sens   Spec Accuracy     Kappa   ROCSD  SensSD  SpecSD
         1 0.6064 0.000000 0.9979   0.7252 -0.002829 0.11620 0.00000 0.01016
         3 0.6105 0.021071 0.9813   0.7191  0.003301 0.10610 0.05503 0.03447
         5 0.6030 0.010714 0.9783   0.7139 -0.014339 0.12140 0.04462 0.04226
         7 0.6138 0.005000 0.9877   0.7193 -0.009955 0.11161 0.02474 0.02842
         9 0.6113 0.035000 0.9701   0.7147  0.006480 0.10063 0.07829 0.04468
        11 0.5891 0.000000 0.9917   0.7208 -0.010703 0.08800 0.00000 0.02841
        13 0.5949 0.000000 1.0000   0.7267  0.000000 0.10663 0.00000 0.00000
        15 0.5941 0.005000 0.9836   0.7162 -0.015025 0.10248 0.02474 0.03487
        17 0.5921 0.000000 0.9907   0.7200 -0.012532 0.09876 0.00000 0.02263
        19 0.6052 0.000000 0.9979   0.7252 -0.002718 0.10539 0.00000 0.01489
        21 0.6080 0.002857 0.9907   0.7207 -0.008502 0.10139 0.02020 0.02488
        23 0.6283 0.005357 0.9857   0.7177 -0.011870 0.10618 0.02657 0.03255
        25 0.6226 0.005357 0.9918   0.7222 -0.003806 0.09542 0.02657 0.02794
        27 0.6105 0.002857 0.9855   0.7169 -0.015451 0.10561 0.02020 0.03329
        29 0.6202 0.005357 0.9908   0.7216 -0.004856 0.12201 0.02657 0.02659
        31 0.5902 0.016786 0.9885   0.7229  0.007459 0.11531 0.04598 0.02856
        33 0.6038 0.026786 0.9848   0.7230  0.015065 0.12858 0.06146 0.03405
        35 0.6339 0.027143 0.9795   0.7191  0.009051 0.13339 0.06209 0.03964
        37 0.6154 0.048929 0.9702   0.7184  0.022169 0.13454 0.10529 0.04676
        39 0.6710 0.104643 0.9598   0.7260  0.082074 0.13048 0.10641 0.04772
        41 0.6559 0.117857 0.9694   0.7365  0.112048 0.12886 0.11595 0.04790
        43 0.6601 0.137857 0.9538   0.7306  0.114381 0.11575 0.13103 0.04776
        45 0.6602 0.112500 0.9487   0.7200  0.076461 0.13013 0.12582 0.05479
        47 0.6943 0.146429 0.9467   0.7275  0.113919 0.12200 0.12367 0.05346
        49 0.6745 0.127500 0.9508   0.7255  0.091337 0.13207 0.16291 0.06245
        51 0.7090 0.197500 0.9425   0.7382  0.164804 0.11652 0.16601 0.05713
        53 0.6945 0.164286 0.9538   0.7373  0.142690 0.12548 0.15692 0.06317
        55 0.6978 0.166786 0.9536   0.7381  0.145468 0.12016 0.15868 0.05358
        57 0.7224 0.225000 0.9301   0.7372  0.182435 0.11323 0.16346 0.07062
        59 0.7086 0.198214 0.9373   0.7353  0.161550 0.12940 0.16359 0.06431
        61 0.7131 0.173571 0.9526   0.7397  0.152089 0.14681 0.15374 0.05339
        63 0.6994 0.201786 0.9508   0.7459  0.185907 0.12576 0.15011 0.05695
        65 0.7067 0.172500 0.9517   0.7388  0.154809 0.12459 0.12248 0.05706
        67 0.7015 0.161429 0.9373   0.7251  0.115826 0.11470 0.15530 0.06475
        69 0.7096 0.178571 0.9415   0.7331  0.143170 0.11379 0.17086 0.07142
        71 0.7136 0.216786 0.9288   0.7342  0.173306 0.10437 0.15212 0.06756
        73 0.6874 0.234286 0.9269   0.7377  0.193418 0.16087 0.14728 0.07524
        75 0.7146 0.177500 0.9496   0.7389  0.159190 0.12630 0.11709 0.05001
        77 0.7189 0.200357 0.9435   0.7403  0.171891 0.14274 0.16054 0.05611
        79 0.7220 0.176786 0.9466   0.7359  0.148615 0.13065 0.15037 0.05573
        81 0.7367 0.227143 0.9597   0.7591  0.227189 0.11942 0.15589 0.04592
        83 0.7392 0.260000 0.9473   0.7597  0.251253 0.12542 0.13251 0.05247
        85 0.7319 0.218214 0.9570   0.7548  0.213702 0.13456 0.15718 0.05740
        87 0.7428 0.259643 0.9516   0.7623  0.252703 0.15221 0.16001 0.05009
        89 0.7439 0.274643 0.9352   0.7545  0.246367 0.11719 0.15914 0.05713
        91 0.7595 0.283214 0.9369   0.7583  0.257821 0.09331 0.16323 0.05074
        93 0.7409 0.256071 0.9350   0.7494  0.228272 0.11549 0.13400 0.05348
        95 0.7524 0.250714 0.9342   0.7471  0.217301 0.09978 0.15667 0.05147
        97 0.7306 0.238214 0.9353   0.7449  0.203307 0.11405 0.15938 0.04691
        99 0.7308 0.280000 0.9268   0.7500  0.241501 0.12963 0.15012 0.05106
       101 0.7265 0.255357 0.9312   0.7465  0.215970 0.10291 0.17091 0.05591
       103 0.7197 0.280714 0.9372   0.7577  0.253722 0.14222 0.17396 0.05224
       105 0.7279 0.234286 0.9436   0.7499  0.211755 0.14193 0.15926 0.05479
       107 0.7456 0.247857 0.9465   0.7554  0.233926 0.11890 0.15194 0.05653
       109 0.7507 0.255000 0.9393   0.7522  0.229079 0.09635 0.16771 0.05939
       111 0.7461 0.255000 0.9590   0.7666  0.259253 0.12612 0.14699 0.04257
       113 0.7518 0.267143 0.9447   0.7592  0.252789 0.09650 0.14879 0.06038
       115 0.7713 0.258214 0.9589   0.7672  0.263819 0.08508 0.14801 0.05072
       117 0.7702 0.243571 0.9498   0.7570  0.234860 0.09519 0.14184 0.05952
       119 0.7605 0.220357 0.9495   0.7499  0.205067 0.08238 0.15292 0.05275
       121 0.7743 0.241786 0.9517   0.7574  0.237198 0.09812 0.12561 0.05411
       123 0.7584 0.280714 0.9467   0.7644  0.268479 0.08869 0.16852 0.05705
       125 0.7942 0.339286 0.9506   0.7832  0.338318 0.07205 0.16489 0.04618
       127 0.7731 0.387857 0.9569   0.8013  0.400311 0.13220 0.16082 0.04310
       129 0.7984 0.422143 0.9548   0.8088  0.432632 0.07905 0.17562 0.05481
       131 0.7769 0.424286 0.9436   0.8014  0.418138 0.13265 0.17293 0.05624
       132 0.7769 0.402143 0.9475   0.7983  0.403517 0.11562 0.15858 0.05861
 AccuracySD KappaSD Selected
    0.01441 0.01401         
    0.03140 0.08531         
    0.03484 0.07018         
    0.02289 0.02972         
    0.03912 0.11306         
    0.02571 0.03597         
    0.01339 0.00000         
    0.02541 0.04511         
    0.02035 0.03032         
    0.01688 0.01922         
    0.02391 0.04407         
    0.02572 0.04782         
    0.02093 0.02935         
    0.02557 0.04149         
    0.02467 0.04702         
    0.02648 0.06738         
    0.02391 0.07722         
    0.03150 0.09220         
    0.03618 0.12457         
    0.03880 0.13217         
    0.04481 0.15396         
    0.05047 0.17032         
    0.05463 0.17769         
    0.03810 0.13242         
    0.04479 0.15554         
    0.04737 0.17116         
    0.04601 0.15784         
    0.04827 0.16931         
    0.05562 0.17556         
    0.05840 0.19088         
    0.04476 0.16073         
    0.05346 0.17446         
    0.04638 0.14586         
    0.04167 0.14837         
    0.06196 0.19803         
    0.05361 0.16956         
    0.06433 0.18473         
    0.04873 0.15096         
    0.05361 0.18296         
    0.04770 0.16618         
    0.05196 0.18409         
    0.05340 0.16931         
    0.05538 0.18451         
    0.05248 0.17640         
    0.05345 0.17649         
    0.05379 0.18031         
    0.04933 0.14716         
    0.05112 0.17577         
    0.04758 0.17363         
    0.05599 0.17706         
    0.04843 0.16940         
    0.05620 0.19190         
    0.05101 0.17896         
    0.05500 0.17980         
    0.05292 0.18030         
    0.04626 0.16770         
    0.04583 0.14661         
    0.05063 0.17167         
    0.05027 0.16002         
    0.04999 0.16962         
    0.04716 0.14377         
    0.05584 0.18904         
    0.04910 0.16983         
    0.04807 0.16637         
    0.06520 0.19926        *
    0.06499 0.19222         
    0.06089 0.17843         

The top 5 variables (out of 129):
   Ab_42, tau, p_tau, MMP10, MIF

> 
> ## Each of these models can be evaluate using the plot() function to see
> ## the profile across subset sizes.
> 
> ## Test set ROC results:
> rfROCfull <- roc(testing$Class,
+                  predict(rfFull, testing[,predVars], type = "prob")[,1])
> rfROCfull

Call:
roc.default(response = testing$Class, predictor = predict(rfFull,     testing[, predVars], type = "prob")[, 1])

Data: predict(rfFull, testing[, predVars], type = "prob")[, 1] in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.9034
> rfROCrfe <- roc(testing$Class,
+                 predict(rfRFE, testing[,predVars])$Impaired)
> rfROCrfe

Call:
roc.default(response = testing$Class, predictor = predict(rfRFE,     testing[, predVars])$Impaired)

Data: predict(rfRFE, testing[, predVars])$Impaired in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8941
> 
> ldaROCfull <- roc(testing$Class,
+                   predict(ldaFull, testing[,predVars], type = "prob")[,1])
> ldaROCfull

Call:
roc.default(response = testing$Class, predictor = predict(ldaFull,     testing[, predVars], type = "prob")[, 1])

Data: predict(ldaFull, testing[, predVars], type = "prob")[, 1] in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8981
> ldaROCrfe <- roc(testing$Class,
+                  predict(ldaRFE, testing[,predVars])$Impaired)
> ldaROCrfe

Call:
roc.default(response = testing$Class, predictor = predict(ldaRFE,     testing[, predVars])$Impaired)

Data: predict(ldaRFE, testing[, predVars])$Impaired in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.9259
> 
> nbROCfull <- roc(testing$Class,
+                   predict(nbFull, testing[,predVars], type = "prob")[,1])
There were 50 or more warnings (use warnings() to see the first 50)
> nbROCfull

Call:
roc.default(response = testing$Class, predictor = predict(nbFull,     testing[, predVars], type = "prob")[, 1])

Data: predict(nbFull, testing[, predVars], type = "prob")[, 1] in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8287
> nbROCrfe <- roc(testing$Class,
+                  predict(nbRFE, testing[,predVars])$Impaired)
Warning message:
In FUN(1:66[[66L]], ...) :
  Numerical 0 probability for all classes with observation 22
> nbROCrfe

Call:
roc.default(response = testing$Class, predictor = predict(nbRFE,     testing[, predVars])$Impaired)

Data: predict(nbRFE, testing[, predVars])$Impaired in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8565
> 
> svmROCfull <- roc(testing$Class,
+                   predict(svmFull, testing[,predVars], type = "prob")[,1])
> svmROCfull

Call:
roc.default(response = testing$Class, predictor = predict(svmFull,     testing[, predVars], type = "prob")[, 1])

Data: predict(svmFull, testing[, predVars], type = "prob")[, 1] in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8727
> svmROCrfe <- roc(testing$Class,
+                  predict(svmRFE, testing[,predVars])$Impaired)
> svmROCrfe

Call:
roc.default(response = testing$Class, predictor = predict(svmRFE,     testing[, predVars])$Impaired)

Data: predict(svmRFE, testing[, predVars])$Impaired in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8681
> 
> lrROCfull <- roc(testing$Class,
+                   predict(lrFull, testing[,predVars], type = "prob")[,1])
> lrROCfull

Call:
roc.default(response = testing$Class, predictor = predict(lrFull,     testing[, predVars], type = "prob")[, 1])

Data: predict(lrFull, testing[, predVars], type = "prob")[, 1] in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8513
> lrROCrfe <- roc(testing$Class,
+                  predict(lrRFE, testing[,predVars])$Impaired)
> lrROCrfe

Call:
roc.default(response = testing$Class, predictor = predict(lrRFE,     testing[, predVars])$Impaired)

Data: predict(lrRFE, testing[, predVars])$Impaired in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.89
> 
> knnROCfull <- roc(testing$Class,
+                   predict(knnFull, testing[,predVars], type = "prob")[,1])
> knnROCfull

Call:
roc.default(response = testing$Class, predictor = predict(knnFull,     testing[, predVars], type = "prob")[, 1])

Data: predict(knnFull, testing[, predVars], type = "prob")[, 1] in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8762
> knnROCrfe <- roc(testing$Class,
+                  predict(knnRFE, testing[,predVars])$Impaired)
> knnROCrfe

Call:
roc.default(response = testing$Class, predictor = predict(knnRFE,     testing[, predVars])$Impaired)

Data: predict(knnRFE, testing[, predVars])$Impaired in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8391
> 
> 
> ## For filter methods, the sbf() function (named for Selection By Filter) is
> ## used. It has similar arguments to rfe() to control the model fitting and
> ## filtering methods. 
> 
> ## P-values are created for filtering. 
> 
> ## A set of four LDA models are fit based on two factors: p-value adjustment 
> ## using a Bonferroni adjustment and whether the predictors should be 
> ## pre-screened for high correlations. 
> 
> sbfResamp <- function(x, fun = mean)
+ {
+   x <- unlist(lapply(x$variables, length))
+   fun(x)
+ }
> sbfROC <- function(mod) auc(roc(testing$Class, predict(mod, testing)$Impaired))
> 
> ## This function calculates p-values using either a t-test (when the predictor
> ## has 2+ distinct values) or using Fisher's Exact Test otherwise.
> 
> pScore <- function(x, y)
+   {
+     numX <- length(unique(x))
+     if(numX > 2)
+       {
+        out <- t.test(x ~ y)$p.value
+       } else {
+        out <- fisher.test(factor(x), y)$p.value
+       }
+     out
+   }
> ldaWithPvalues <- ldaSBF
> ldaWithPvalues$score <- pScore
> ldaWithPvalues$summary <- fiveStats
> 
> ## Predictors are retained if their p-value is less than the completely 
> ## subjective cut-off of 0.05.
> 
> ldaWithPvalues$filter <- function (score, x, y)
+ {
+   keepers <- score <= 0.05
+   keepers
+ }
> 
> sbfCtrl <- sbfControl(method = "repeatedcv",
+                       repeats = 5,
+                       verbose = TRUE,
+                       functions = ldaWithPvalues,
+                       index = index)
> 
> rawCorr <- sbf(training[, predVars],
+                training$Class,
+                tol = 1.0e-12,
+                sbfControl = sbfCtrl)
> rawCorr

Selection By Filter

Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 

Resampling performance:

    ROC   Sens   Spec Accuracy  Kappa   ROCSD SensSD  SpecSD AccuracySD KappaSD
 0.9168 0.7439 0.9136    0.867 0.6588 0.06458 0.1778 0.05973     0.0567  0.1512

Using the training set, 47 variables were selected:
   Alpha_1_Antitrypsin, Apolipoprotein_D, B_Lymphocyte_Chemoattractant_BL, Complement_3, Cortisol...

During resampling, the top 5 selected variables (out of a possible 66):
   Ab_42 (100%), age (100%), Cortisol (100%), Creatine_Kinase_MB (100%), Cystatin_C (100%)

On average, 46.1 variables were selected (min = 38, max = 57)
> 
> ldaWithPvalues$filter <- function (score, x, y)
+ {
+   score <- p.adjust(score,  "bonferroni")
+   keepers <- score <= 0.05
+   keepers
+ }
> sbfCtrl <- sbfControl(method = "repeatedcv",
+                       repeats = 5,
+                       verbose = TRUE,
+                       functions = ldaWithPvalues,
+                       index = index)
> 
> adjCorr <- sbf(training[, predVars],
+                training$Class,
+                tol = 1.0e-12,
+                sbfControl = sbfCtrl)
> adjCorr

Selection By Filter

Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 

Resampling performance:

    ROC   Sens   Spec Accuracy  Kappa   ROCSD SensSD  SpecSD AccuracySD KappaSD
 0.8563 0.6443 0.9083   0.8361 0.5663 0.07646  0.201 0.06721    0.06283  0.1778

Using the training set, 17 variables were selected:
   Creatine_Kinase_MB, Eotaxin_3, FAS, GRO_alpha, IGF_BP_2...

During resampling, the top 5 selected variables (out of a possible 23):
   Ab_42 (100%), GRO_alpha (100%), MIF (100%), p_tau (100%), tau (100%)

On average, 13.5 variables were selected (min = 9, max = 19)
> 
> ldaWithPvalues$filter <- function (score, x, y)
+ {
+   keepers <- score <= 0.05
+   corrMat <- cor(x[,keepers])
+   tooHigh <- findCorrelation(corrMat, .75)
+   if(length(tooHigh) > 0) keepers[tooHigh] <- FALSE
+   keepers
+ }
> sbfCtrl <- sbfControl(method = "repeatedcv",
+                       repeats = 5,
+                       verbose = TRUE,
+                       functions = ldaWithPvalues,
+                       index = index)
> 
> rawNoCorr <- sbf(training[, predVars],
+                  training$Class,
+                  tol = 1.0e-12,
+                  sbfControl = sbfCtrl)
> rawNoCorr

Selection By Filter

Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 

Resampling performance:

   ROC   Sens   Spec Accuracy  Kappa   ROCSD SensSD  SpecSD AccuracySD KappaSD
 0.918 0.7357 0.9125   0.8638 0.6508 0.06282 0.1787 0.06498    0.05687  0.1474

Using the training set, 45 variables were selected:
   Alpha_1_Antitrypsin, Apolipoprotein_D, B_Lymphocyte_Chemoattractant_BL, Complement_3, Cortisol...

During resampling, the top 5 selected variables (out of a possible 66):
   Ab_42 (100%), age (100%), E4 (100%), IGF_BP_2 (100%), IL_17E (100%)

On average, 44.3 variables were selected (min = 37, max = 54)
> 
> ldaWithPvalues$filter <- function (score, x, y)
+ {
+   score <- p.adjust(score,  "bonferroni")
+   keepers <- score <= 0.05
+   corrMat <- cor(x[,keepers])
+   tooHigh <- findCorrelation(corrMat, .75)
+   if(length(tooHigh) > 0) keepers[tooHigh] <- FALSE
+   keepers
+ }
> sbfCtrl <- sbfControl(method = "repeatedcv",
+                       repeats = 5,
+                       verbose = TRUE,
+                       functions = ldaWithPvalues,
+                       index = index)
> 
> adjNoCorr <- sbf(training[, predVars],
+                  training$Class,
+                  tol = 1.0e-12,
+                  sbfControl = sbfCtrl)
> adjNoCorr

Selection By Filter

Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 

Resampling performance:

    ROC   Sens   Spec Accuracy  Kappa   ROCSD SensSD  SpecSD AccuracySD KappaSD
 0.8563 0.6443 0.9083   0.8361 0.5663 0.07646  0.201 0.06721    0.06283  0.1778

Using the training set, 17 variables were selected:
   Creatine_Kinase_MB, Eotaxin_3, FAS, GRO_alpha, IGF_BP_2...

During resampling, the top 5 selected variables (out of a possible 23):
   Ab_42 (100%), GRO_alpha (100%), MIF (100%), p_tau (100%), tau (100%)

On average, 13.5 variables were selected (min = 9, max = 19)
> 
> ## Filter methods test set ROC results:
> 
> sbfROC(rawCorr)
Area under the curve: 0.9178
> sbfROC(rawNoCorr)
Area under the curve: 0.9155
> sbfROC(adjCorr)
Area under the curve: 0.9259
> sbfROC(adjNoCorr)
Area under the curve: 0.9259
> 
> ## Get the resampling results for all the models
> 
> rfeResamples <- resamples(list(RF = rfRFE,
+                                "Logistic Reg." = lrRFE,
+                                "SVM" = svmRFE,
+                                "$K$--NN" = knnRFE,
+                                "N. Bayes" = nbRFE,
+                                "LDA" = ldaRFE))
> summary(rfeResamples)

Call:
summary.resamples(object = rfeResamples)

Models: RF, Logistic Reg., SVM, $K$--NN, N. Bayes, LDA 
Number of resamples: 50 

ROC 
                Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
RF            0.3714  0.8694 0.9229 0.8996  0.9611 1.0000    0
Logistic Reg. 0.6429  0.7984 0.8571 0.8571  0.9370 1.0000    0
SVM           0.7000  0.8421 0.8947 0.8914  0.9611 1.0000    0
$K$--NN       0.6283  0.7332 0.8004 0.7984  0.8709 0.9211    0
N. Bayes      0.6357  0.7759 0.8346 0.8318  0.8797 0.9925    0
LDA           0.7429  0.8716 0.9312 0.9163  0.9783 1.0000    0

Sens 
                Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
RF            0.2857  0.5714 0.7143 0.6696  0.7500 1.0000    0
Logistic Reg. 0.3750  0.5714 0.6250 0.6536  0.7411 1.0000    0
SVM           0.3750  0.5714 0.7143 0.6914  0.7500 1.0000    0
$K$--NN       0.1250  0.2857 0.4286 0.4221  0.5714 0.7143    0
N. Bayes      0.2857  0.5714 0.7143 0.6807  0.7500 1.0000    0
LDA           0.2500  0.6250 0.7143 0.7407  0.8571 1.0000    0

Spec 
                Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
RF            0.8500  0.9474 1.0000 0.9650  1.0000    1    0
Logistic Reg. 0.7000  0.8500 0.9000 0.9053  0.9474    1    0
SVM           0.7368  0.8947 0.9474 0.9302  1.0000    1    0
$K$--NN       0.7000  0.9474 0.9500 0.9548  1.0000    1    0
N. Bayes      0.6500  0.8000 0.8421 0.8387  0.8947    1    0
LDA           0.7895  0.8947 0.9000 0.9217  0.9500    1    0

Accuracy 
                Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
RF            0.7407  0.8519 0.8889 0.8839  0.9252 0.9630    0
Logistic Reg. 0.6667  0.7912 0.8462 0.8361  0.8777 0.9630    0
SVM           0.7692  0.8148 0.8519 0.8646  0.8929 1.0000    0
$K$--NN       0.6071  0.7778 0.8113 0.8088  0.8462 0.9231    0
N. Bayes      0.6538  0.7500 0.7778 0.7950  0.8462 0.9630    0
LDA           0.7407  0.8276 0.8846 0.8721  0.9231 1.0000    0

Kappa 
                 Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
RF            0.21580  0.5738 0.7027 0.6791  0.7874 0.9078    0
Logistic Reg. 0.23820  0.4717 0.5702 0.5732  0.6676 0.9078    0
SVM           0.35540  0.5408 0.6157 0.6435  0.7450 1.0000    0
$K$--NN       0.05263  0.3307 0.4348 0.4326  0.5737 0.7851    0
N. Bayes      0.21800  0.3999 0.4957 0.5000  0.6370 0.9143    0
LDA           0.28950  0.5519 0.6808 0.6678  0.7851 1.0000    0

> 
> fullResamples <- resamples(list(RF = rfFull,
+                                 "Logistic Reg." = lrFull,
+                                 "SVM" = svmFull,
+                                 "$K$--NN" = knnFull,
+                                 "N. Bayes" = nbFull,
+                                 "LDA" = ldaFull))
> summary(fullResamples)

Call:
summary.resamples(object = fullResamples)

Models: RF, Logistic Reg., SVM, $K$--NN, N. Bayes, LDA 
Number of resamples: 50 

ROC 
                Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
RF            0.7179  0.8528 0.8980 0.8904  0.9423 1.0000    0
Logistic Reg. 0.5214  0.7240 0.7951 0.7846  0.8612 0.9464    0
SVM           0.7143  0.8441 0.8938 0.8920  0.9611 1.0000    0
$K$--NN       0.7030  0.8047 0.8536 0.8494  0.9011 0.9737    0
N. Bayes      0.5263  0.7237 0.8036 0.7980  0.8690 1.0000    0
LDA           0.5357  0.7864 0.8571 0.8439  0.9059 0.9850    0

Sens 
                Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
RF            0.0000  0.3080 0.4643 0.4496  0.5714 0.7143    0
Logistic Reg. 0.1429  0.5714 0.7143 0.6696  0.7143 1.0000    0
SVM           0.2857  0.5714 0.7143 0.6964  0.7500 1.0000    0
$K$--NN       0.0000  0.1295 0.1429 0.1957  0.2857 0.4286    0
N. Bayes      0.2500  0.4464 0.5714 0.5936  0.7143 0.8750    0
LDA           0.2500  0.5714 0.7143 0.6857  0.8304 1.0000    0

Spec 
                Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
RF            0.9000  0.9625 1.0000 0.9847  1.0000    1    0
Logistic Reg. 0.4737  0.7368 0.7895 0.7779  0.8500    1    0
SVM           0.7368  0.9000 0.9474 0.9332  1.0000    1    0
$K$--NN       0.9474  1.0000 1.0000 0.9907  1.0000    1    0
N. Bayes      0.6316  0.7500 0.8000 0.8139  0.8947    1    0
LDA           0.6842  0.7500 0.8421 0.8294  0.8947    1    0

Accuracy 
                Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
RF            0.7308  0.8148 0.8462 0.8383  0.8846 0.9231    0
Logistic Reg. 0.5185  0.6952 0.7692 0.7478  0.8077 0.8889    0
SVM           0.7407  0.8462 0.8709 0.8683  0.9155 0.9630    0
$K$--NN       0.6923  0.7500 0.7692 0.7731  0.8022 0.8519    0
N. Bayes      0.5714  0.7037 0.7692 0.7530  0.8077 0.8889    0
LDA           0.6667  0.7500 0.7778 0.7900  0.8462 0.9259    0

Kappa 
                  Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
RF             0.00000  0.3878 0.5229 0.5057  0.6609 0.7851    0
Logistic Reg. -0.19800  0.3292 0.4336 0.4170  0.5098 0.7417    0
SVM            0.28950  0.5702 0.6554 0.6527  0.7788 0.9065    0
$K$--NN       -0.07216  0.1695 0.1980 0.2417  0.3573 0.5263    0
N. Bayes       0.02326  0.2863 0.4075 0.3966  0.5092 0.7235    0
LDA            0.10330  0.3741 0.4757 0.4910  0.6089 0.8224    0

> 
> filteredResamples <- resamples(list("No Adjustment, Corr Vars" = rawCorr,
+                                     "No Adjustment, No Corr Vars" = rawNoCorr,
+                                     "Bonferroni, Corr Vars" = adjCorr,
+                                     "Bonferroni, No Corr Vars" = adjNoCorr))
> summary(filteredResamples)

Call:
summary.resamples(object = filteredResamples)

Models: No Adjustment, Corr Vars, No Adjustment, No Corr Vars, Bonferroni, Corr Vars, Bonferroni, No Corr Vars 
Number of resamples: 50 

ROC 
                              Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
No Adjustment, Corr Vars    0.7714  0.8647 0.9281 0.9168  0.9768    1    0
No Adjustment, No Corr Vars 0.7786  0.8816 0.9263 0.9180  0.9759    1    0
Bonferroni, Corr Vars       0.6643  0.8239 0.8531 0.8563  0.8970    1    0
Bonferroni, No Corr Vars    0.6643  0.8239 0.8531 0.8563  0.8970    1    0

Sens 
                              Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
No Adjustment, Corr Vars    0.2500  0.5848 0.7321 0.7439  0.8571    1    0
No Adjustment, No Corr Vars 0.3750  0.5714 0.7143 0.7357  0.8571    1    0
Bonferroni, Corr Vars       0.2857  0.5000 0.6250 0.6443  0.7500    1    0
Bonferroni, No Corr Vars    0.2857  0.5000 0.6250 0.6443  0.7500    1    0

Spec 
                              Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
No Adjustment, Corr Vars    0.7895  0.8947    0.9 0.9136    0.95    1    0
No Adjustment, No Corr Vars 0.7500  0.8500    0.9 0.9125    0.95    1    0
Bonferroni, Corr Vars       0.7368  0.8500    0.9 0.9083    0.95    1    0
Bonferroni, No Corr Vars    0.7368  0.8500    0.9 0.9083    0.95    1    0

Accuracy 
                              Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
No Adjustment, Corr Vars    0.7407  0.8462 0.8571 0.8670  0.8919    1    0
No Adjustment, No Corr Vars 0.7407  0.8226 0.8519 0.8638  0.8889    1    0
Bonferroni, Corr Vars       0.7037  0.7778 0.8462 0.8361  0.8846    1    0
Bonferroni, No Corr Vars    0.7037  0.7778 0.8462 0.8361  0.8846    1    0

Kappa 
                              Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
No Adjustment, Corr Vars    0.3193  0.5705 0.6609 0.6588  0.7390    1    0
No Adjustment, No Corr Vars 0.3549  0.5702 0.6414 0.6508  0.7381    1    0
Bonferroni, Corr Vars       0.2087  0.4343 0.5766 0.5663  0.6957    1    0
Bonferroni, No Corr Vars    0.2087  0.4343 0.5766 0.5663  0.6957    1    0

> 
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] klaR_0.6-7                      kernlab_0.9-16                 
 [3] MASS_7.3-26                     e1071_1.6-1                    
 [5] class_7.3-7                     pROC_1.5.4                     
 [7] plyr_1.8                        randomForest_4.6-7             
 [9] corrplot_0.71                   RColorBrewer_1.0-5             
[11] doMC_1.3.0                      iterators_1.0.6                
[13] foreach_1.4.0                   caret_6.0-22                   
[15] ggplot2_0.9.3.1                 lattice_0.20-15                
[17] AppliedPredictiveModeling_1.1-5

loaded via a namespace (and not attached):
 [1] car_2.0-16       codetools_0.2-8  colorspace_1.2-1 compiler_3.0.1  
 [5] CORElearn_0.9.41 dichromat_2.0-0  digest_0.6.3     grid_3.0.1      
 [9] gtable_0.1.2     labeling_0.1     munsell_0.4      proto_0.3-10    
[13] reshape2_1.2.2   scales_0.2.3     stringr_0.6.2    tools_3.0.1     
> 
> 
> 
> proc.time()
      user     system    elapsed 
257587.585   7078.267  35323.717 
In [ ]:
%%R -w 600 -h 600

## runChapterScript(19)

##       user     system    elapsed 
## 257587.585   7078.267  35323.717